CartPole-v1 Policy Gradient Reinforcement Learning Model
Model Description
This model is a Policy Gradient (REINFORCE) agent trained to solve the CartPole-v1 environment from OpenAI Gym. The agent learns to balance a pole on a cart by taking discrete actions (left or right) to maximize the cumulative reward.
Model Details
Model Architecture
- Algorithm: REINFORCE (Monte Carlo Policy Gradient)
- Neural Network: Simple feedforward network
- Hidden layer size: 16 units
- Activation function: ReLU (typical for policy networks)
- Output layer: Softmax for action probabilities
Training Configuration
- Environment: CartPole-v1 (OpenAI Gym)
- Training Episodes: 2,000
- Max Steps per Episode: 1,000
- Learning Rate: 0.01
- Discount Factor (γ): 1.0 (no discounting)
- Optimizer: Adam (PyTorch default)
Environment Details
CartPole-v1 is a classic control problem where:
- Observation Space: 4-dimensional continuous space
- Cart position: [-4.8, 4.8]
- Cart velocity: [-∞, ∞]
- Pole angle: [-0.418 rad, 0.418 rad]
- Pole angular velocity: [-∞, ∞]
- Action Space: 2 discrete actions (0: push left, 1: push right)
- Reward: +1 for every step the pole remains upright
- Episode Termination:
- Pole angle > ±12°
- Cart position > ±2.4
- Episode length > 500 steps (CartPole-v1 limit)
Training Process
The model was trained using the REINFORCE algorithm with the following key features:
- Return Calculation: Monte Carlo returns computed using dynamic programming for efficiency
- Reward Standardization: Returns are normalized (zero mean, unit variance) for training stability
- Policy Loss: Negative log-probability weighted by standardized returns
- Gradient Update: Standard backpropagation with Adam optimizer
Key Implementation Details
- Returns calculated in reverse chronological order for computational efficiency
- Numerical stability ensured by adding epsilon to standard deviation
- Deque data structure used for efficient O(1) operations
Performance
The model is evaluated over 10 episodes after training. Expected performance:
- Target: Consistently achieve scores close to 500 (maximum possible in CartPole-v1)
- Success Criterion: Average score > 475 over evaluation episodes
- Training Stability: 100-episode rolling average tracked during training
Usage
# Load the trained policy
policy = torch.load('policy_model.pth')
# Use the policy to select actions
state = env.reset()
action, log_prob = policy.act(state)
Limitations and Considerations
- Environment Specific: Model is specifically trained for CartPole-v1 and won't generalize to other environments
- Sample Efficiency: REINFORCE can be sample inefficient compared to modern policy gradient methods
- Variance: High variance in policy gradient estimates (not using baseline/critic)
- Hyperparameter Sensitivity: Performance may be sensitive to learning rate and network architecture
Ethical Considerations
This is a simple control task with no ethical implications. The model is designed for:
- Educational purposes in reinforcement learning
- Benchmarking and algorithm development
- Research in policy gradient methods
Training Environment
- Framework: PyTorch
- Environment: OpenAI Gym
- Monitoring: 100-episode rolling average for performance tracking
Model Files
policy_model.pth: Trained policy network weightstraining_scores.pkl: Training episode scores for analysis
Citation
If you use this model, please cite:
@misc{cartpole-policy-gradient-2024,
title={CartPole-v1 Policy Gradient Reinforcement Learning Model},
author={Adilbai},
year={2024},
publisher={Hugging Face Hub},
url={https://huggingface.co/Adilbai/CartPole-v1-policy-gradient-RL}
}
References
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
- Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4), 229-256.
- OpenAI Gym CartPole-v1 Environment Documentation
For questions or issues with this model, please open an issue in the repository.
Evaluation results
- mean_reward on CartPole-v1self-reported500.00 +/- 0.00