SIMPLE Game Project - Terminology
Below is a list of important terms and definitions used in this project. This page is here to help clarify concepts and reduce confusion.
Key Terms
Term | Definition |
---|---|
Reinforcement Learning | A type of machine learning where agents learn by interacting with an environment and receiving feedback in the form of rewards or penalties. |
Multiplayer Game | A game where multiple players interact and compete against each other, often requiring strategies that adapt to opponents' actions. |
SIMPLE | An acronym for Self-play In MultiPlayer Environments, a framework that trains AI by pitting versions of itself against one another. |
Agent | An AI or program that interacts with an environment to make decisions and perform actions. |
Self-Play | A training mechanism where an AI agent competes against itself or past versions to improve and learn strategies. |
Policy | The strategy or rule set that an agent follows when making decisions in a given state within the environment. |
Environment | The virtual world or game scenario where the agent interacts, performs actions, and receives feedback. |
State | A representation of the current situation or configuration in the environment that the agent can perceive. |
Action | A move or decision taken by the agent to interact with the environment and progress toward a goal. |
Reward | Feedback given to an agent after an action, used to reinforce learning by encouraging positive outcomes. |
Training Epoch | A full iteration over the training dataset during the machine learning process, often used to refine the agent's model. |
Model | A mathematical framework or neural network that an agent uses to process data, predict outcomes, and make decisions. |
Opponent Modeling | A technique where an AI learns or predicts the strategies and actions of an opponent to adapt and compete effectively. |
Exploration vs Exploitation | A trade-off in AI training where the agent must balance trying new actions (exploration) and using known successful strategies (exploitation). |
Optimization Metrics
Metric | Meaning |
---|---|
Pol_surr (Policy Surrogate Loss) | Measures the difference in the policy after an update, constrained by PPO’s clipping mechanism. Negative values indicate the policy is improving (higher probability for better actions). |
Pol_entpen (Policy Entropy Penalty) | Represents the entropy term encouraging exploration. Higher entropy means the policy explores more; as training progresses, entropy should gradually decrease as the policy converges. |
Vf_loss (Value Function Loss) | Measures the error in the value function approximation. Lower values indicate the agent is better at predicting expected returns. |
Kl (Kullback-Leibler Divergence) | Measures the change in policy before and after the update. A small Kl divergence suggests the policy update is within acceptable bounds. |
Ent (Entropy) | Reflects the randomness of the policy. High entropy indicates more exploration; as training progresses, entropy decreases as the policy stabilizes. |
Evaluation Results
Metric | Meaning |
---|---|
EpLenMean (Episode Length Mean) | Average number of steps per episode. If this stabilizes, it might indicate that the agent is learning an optimal strategy. |
EpRewMean (Episode Reward Mean) | Average reward per episode. This is a primary measure of learning progress, where an increase implies the agent is performing better in the environment. |
EpThisIter (Episodes This Iteration) | The number of episodes completed in the current training iteration. |
EpisodesSoFar | Total episodes completed so far. |
TimestepsSoFar | Total timesteps processed. |
Loss Metrics
Metric | Meaning |
---|---|
Ev_tdlam_before | The expected value of TD(lambda) before the policy update. Positive values close to 1 indicate good alignment between predicted values and returns. |
Loss_ent | Entropy of the policy (should decrease over time as policy converges). |
Loss_kl | Kl divergence for the policy update. |
Loss_pol_entpen | Entropy penalty term for the policy. |
Loss_vf_loss | Loss related to the value function approximation. |
Test.py Results Meanings
Explanation | Details |
---|---|
Most Recent Model | The first model listed is the most recently trained, while the second is the previous model competitor. |
Cumulative Scores | Each line shows the cumulative scores after a certain number of games. Positive scores indicate success or winning, while negative scores indicate losses. |
Winning Score | A win adds 1 to the agent’s total score. |
Losing Score | A loss subtracts 1 from the agent’s total score. |
Draw | A draw results in no change. |