
Approximate _ KL: Approximate KL divergence, measuring the difference between old and new strategies. In PPO, this is an important superparameter, which is used to control the extent of policy update. The value here is very small (0.0018954609), which means that the difference between the old and new strategies is very small, which is usually expected because it helps to maintain the stability of training.
Clip_fraction: clipping ratio, which indicates how much gradient is clipped because it exceeds the clipping threshold during the update process. 0.0511 here means that only a small part of the gradient is clipped, which usually means that the policy update is relatively smooth.
Clip_range: clipping range, which is used to limit the threshold of the difference between the old and new policies. In PPO, this helps to prevent the policy from being updated too much. 0.2 here is the upper limit of the cropping range.
Entropy_loss: Entropy loss, which is used to encourage the strategy to output more diversified actions to avoid premature convergence of the strategy to the local optimum. The negative value here (-0.389) indicates that entropy loss is being optimized to reduce (although we usually pay attention to the positive value of loss function in optimization problems, the negative value here may only indicate the direction of optimization).
Explained_variance: explains variance, which is usually used to evaluate the accuracy of value function prediction. A negative value may indicate that there is a problem with the evaluation method or implementation, or the prediction error of the value function is very large. The -0.754 here may need further investigation.
Learning_rate: learning rate, which controls the step of updating model parameters. 0.0003 here is a relatively small value, which means that the parameter update is cautious.
Loss: the value of loss function is the goal of algorithm optimization. Here, -0.0159 may represent the average of overall losses (including strategic losses and value losses).
N_updates: number of updates, indicating how many times the model parameters have been updated during training. 480 times here means that the parameters have been updated so many times.
Policy_gradient_loss: policy gradient loss is a part of policy network loss, which is used to guide the optimization of policy network. A negative value (-0.00143) here indicates that the strategic gradient loss is being optimized to reduce it.
Value_loss: Value loss is the loss value of a value function network (such as a network that predicts the state value). 3.47e-06 here means that the value loss is very small, which may mean that the value function has been trained relatively well.