Round 1: 49 Iterations
********** Iteration 49 ************
Eval num_timesteps=50176, episode_reward=0.04 +/- 0.00
Total episodes ran=100
Optimizing...
pol_surr | pol_entpen | vf_loss | kl | ent
1.68e-08 | -0.18878 | 0.59067 | 3.35e-10 | 1.88780
-0.00044 | -0.18819 | 0.59072 | 0.00017 | 1.88193
-0.00106 | -0.18767 | 0.59025 | 0.00056 | 1.87667
-0.00191 | -0.18721 | 0.58957 | 0.00107 | 1.87212
Evaluating losses...
-0.00300 | -0.18683 | 0.58875 | 0.00161 | 1.86834
-------------------------------------------
| EpLenMean | 9.51 |
| EpRewMean | 0.12 |
| EpThisIter | 107 |
| EpisodesSoFar | 5544 |
| TimeElapsed | 353 |
| TimestepsSoFar | 51200 |
| ev_tdlam_before | 0.00893 |
| loss_ent | 1.8683362 |
| loss_kl | 0.0016059591 |
| loss_pol_entpen | -0.18683362 |
| loss_pol_surr | -0.0030006962 |
| loss_vf_loss | 0.5887544 |
---------------------------------------------
Policy Surrogate Loss: It starts near zero (tiny updates initially) and becomes slightly more negative as the policy improves. The agent is making small improvements in action probabilities.
Policy Entropy Penalty: This value remains relatively stable but slightly decreases, as the policy’s randomness reduces while maintaining exploration.
Value Function Loss: The loss decreases slightly across iterations, suggesting the value function is still adjusting to predict returns more accurately.
KL Divergence: It starts very small, indicating the policy updates are minimal and within PPO’s constraints.
Entropy: Entropy decreases steadily, reflecting reduced exploration as the policy converges toward exploitation.
In conclusion, the agent is improving its policy (pol_surr decreasing), maintaining sufficient exploration (ent reducing gradually, but not collapsing too early), conforming to PPO’s constraints (kl remains small), and it incrementally is improving its value predictions (vf-loss is stable).
Playing 100 games...
Played 1 games: {'best_model_qnzjl': -1, 'base_jfyww': 1}
Played 2 games: {'best_model_qnzjl': -2, 'base_jfyww': 2}
Played 3 games: {'best_model_qnzjl': -3, 'base_jfyww': 3}
Played 4 games: {'best_model_qnzjl': -4, 'base_jfyww': 4}
Played 5 games: {'best_model_qnzjl': -3, 'base_jfyww': 3}
Played 6 games: {'best_model_qnzjl': -2, 'base_jfyww': 2}
Played 7 games: {'best_model_qnzjl': -1, 'base_jfyww': 1}
Played 8 games: {'best_model_qnzjl': 0, 'base_jfyww': 0}
Played 9 games: {'best_model_qnzjl': 1, 'base_jfyww': -1}
Played 10 games: {'best_model_qnzjl': 0, 'base_jfyww': 0}
Played 11 games: {'best_model_qnzjl': 1, 'base_jfyww': -1}
Played 12 games: {'best_model_qnzjl': 0, 'base_jfyww': 0}
Played 13 games: {'best_model_qnzjl': 1, 'base_jfyww': -1}
Played 14 games: {'best_model_qnzjl': 2, 'base_jfyww': -2}
Played 15 games: {'best_model_qnzjl': 1, 'base_jfyww': -1}
Played 16 games: {'best_model_qnzjl': 0, 'base_jfyww': 0}
Played 17 games: {'best_model_qnzjl': -1, 'base_jfyww': 1}
Played 18 games: {'best_model_qnzjl': -2, 'base_jfyww': 2}
Played 19 games: {'best_model_qnzjl': -3, 'base_jfyww': 3}
Played 20 games: {'best_model_qnzjl': -2, 'base_jfyww': 2}
Played 21 games: {'best_model_qnzjl': -1, 'base_jfyww': 1}
Played 22 games: {'best_model_qnzjl': 0, 'base_jfyww': 0}
Played 23 games: {'best_model_qnzjl': 1, 'base_jfyww': -1}
Played 24 games: {'best_model_qnzjl': 2, 'base_jfyww': -2}
Played 25 games: {'best_model_qnzjl': 3, 'base_jfyww': -3}
Played 26 games: {'best_model_qnzjl': 4, 'base_jfyww': -4}
Played 27 games: {'best_model_qnzjl': 5, 'base_jfyww': -5}
Played 28 games: {'best_model_qnzjl': 4, 'base_jfyww': -4}
Played 29 games: {'best_model_qnzjl': 5, 'base_jfyww': -5}
Played 30 games: {'best_model_qnzjl': 6, 'base_jfyww': -6}
Played 31 games: {'best_model_qnzjl': 5, 'base_jfyww': -5}
Played 32 games: {'best_model_qnzjl': 6, 'base_jfyww': -6}
Played 33 games: {'best_model_qnzjl': 7, 'base_jfyww': -7}
Played 34 games: {'best_model_qnzjl': 8, 'base_jfyww': -8}
Played 35 games: {'best_model_qnzjl': 7, 'base_jfyww': -7}
Played 36 games: {'best_model_qnzjl': 8, 'base_jfyww': -8}
Played 37 games: {'best_model_qnzjl': 7, 'base_jfyww': -7}
Played 38 games: {'best_model_qnzjl': 8, 'base_jfyww': -8}
Played 39 games: {'best_model_qnzjl': 9, 'base_jfyww': -9}
Played 40 games: {'best_model_qnzjl': 10, 'base_jfyww': -10}
Played 41 games: {'best_model_qnzjl': 9, 'base_jfyww': -9}
Played 42 games: {'best_model_qnzjl': 10, 'base_jfyww': -10}
Played 43 games: {'best_model_qnzjl': 11, 'base_jfyww': -11}
Played 44 games: {'best_model_qnzjl': 12, 'base_jfyww': -12}
Played 45 games: {'best_model_qnzjl': 13, 'base_jfyww': -13}
Played 46 games: {'best_model_qnzjl': 14, 'base_jfyww': -14}
Played 47 games: {'best_model_qnzjl': 15, 'base_jfyww': -15}
Played 48 games: {'best_model_qnzjl': 14, 'base_jfyww': -14}
Played 49 games: {'best_model_qnzjl': 13, 'base_jfyww': -13}
Played 50 games: {'best_model_qnzjl': 14, 'base_jfyww': -14}
Played 51 games: {'best_model_qnzjl': 15, 'base_jfyww': -15}
Played 52 games: {'best_model_qnzjl': 16, 'base_jfyww': -16}
Played 53 games: {'best_model_qnzjl': 17, 'base_jfyww': -17}
Played 54 games: {'best_model_qnzjl': 16, 'base_jfyww': -16}
Played 55 games: {'best_model_qnzjl': 17, 'base_jfyww': -17}
Played 56 games: {'best_model_qnzjl': 16, 'base_jfyww': -16}
Played 57 games: {'best_model_qnzjl': 17, 'base_jfyww': -17}
Played 58 games: {'best_model_qnzjl': 16, 'base_jfyww': -16}
Played 59 games: {'best_model_qnzjl': 15, 'base_jfyww': -15}
Played 60 games: {'best_model_qnzjl': 16, 'base_jfyww': -16}
Played 61 games: {'best_model_qnzjl': 15, 'base_jfyww': -15}
Played 62 games: {'best_model_qnzjl': 16, 'base_jfyww': -16}
Played 63 games: {'best_model_qnzjl': 17, 'base_jfyww': -17}
Played 64 games: {'best_model_qnzjl': 18, 'base_jfyww': -18}
Played 65 games: {'best_model_qnzjl': 19, 'base_jfyww': -19}
Played 66 games: {'best_model_qnzjl': 18, 'base_jfyww': -18}
Played 67 games: {'best_model_qnzjl': 19, 'base_jfyww': -19}
Played 68 games: {'best_model_qnzjl': 20, 'base_jfyww': -20}
Played 69 games: {'best_model_qnzjl': 19, 'base_jfyww': -19}
Played 70 games: {'best_model_qnzjl': 20, 'base_jfyww': -20}
Played 71 games: {'best_model_qnzjl': 21, 'base_jfyww': -21}
Played 72 games: {'best_model_qnzjl': 22, 'base_jfyww': -22}
Played 73 games: {'best_model_qnzjl': 21, 'base_jfyww': -21}
Played 74 games: {'best_model_qnzjl': 22, 'base_jfyww': -22}
Played 75 games: {'best_model_qnzjl': 23, 'base_jfyww': -23}
Played 76 games: {'best_model_qnzjl': 24, 'base_jfyww': -24}
Played 77 games: {'best_model_qnzjl': 23, 'base_jfyww': -23}
Played 78 games: {'best_model_qnzjl': 24, 'base_jfyww': -24}
Played 79 games: {'best_model_qnzjl': 23, 'base_jfyww': -23}
Played 80 games: {'best_model_qnzjl': 22, 'base_jfyww': -22}
Played 81 games: {'best_model_qnzjl': 23, 'base_jfyww': -23}
Played 82 games: {'best_model_qnzjl': 24, 'base_jfyww': -24}
Played 83 games: {'best_model_qnzjl': 25, 'base_jfyww': -25}
Played 84 games: {'best_model_qnzjl': 26, 'base_jfyww': -26}
Played 85 games: {'best_model_qnzjl': 27, 'base_jfyww': -27}
Played 86 games: {'best_model_qnzjl': 26, 'base_jfyww': -26}
Played 87 games: {'best_model_qnzjl': 27, 'base_jfyww': -27}
Played 88 games: {'best_model_qnzjl': 26, 'base_jfyww': -26}
Played 89 games: {'best_model_qnzjl': 27, 'base_jfyww': -27}
Played 90 games: {'best_model_qnzjl': 28, 'base_jfyww': -28}
Played 91 games: {'best_model_qnzjl': 27, 'base_jfyww': -27}
Played 92 games: {'best_model_qnzjl': 28, 'base_jfyww': -28}
Played 93 games: {'best_model_qnzjl': 29, 'base_jfyww': -29}
Played 94 games: {'best_model_qnzjl': 30, 'base_jfyww': -30}
Played 95 games: {'best_model_qnzjl': 31, 'base_jfyww': -31}
Played 96 games: {'best_model_qnzjl': 32, 'base_jfyww': -32}
Played 97 games: {'best_model_qnzjl': 31, 'base_jfyww': -31}
Played 98 games: {'best_model_qnzjl': 32, 'base_jfyww': -32}
Played 99 games: {'best_model_qnzjl': 33, 'base_jfyww': -33}
Played 100 games: {'best_model_qnzjl': 34, 'base_jfyww': -34}
In early games, the base agent appears to have either better initial strategies or weaknesses in the best_model policy.
In later games, as best_model adjusts (or the base strategy is no longer working), best_model begins to pull ahead.
These results could mean the best_model adapts better to the environment or has a higher skill ceiling or the base relies on static strategies that become predictable.
In the game of Connect4, the ability to plan ahead is key. So best_model winning all the later games could mean superior long-term planning and the base could rely more on shallow strategies.
Back to Home
Next Round