CartPole-v1 Policy Gradient Reinforcement Learning Model

Model Description

This model is a Policy Gradient (REINFORCE) agent trained to solve the CartPole-v1 environment from OpenAI Gym. The agent learns to balance a pole on a cart by taking discrete actions (left or right) to maximize the cumulative reward.

Model Details

Model Architecture

Algorithm: REINFORCE (Monte Carlo Policy Gradient)
Neural Network: Simple feedforward network
- Hidden layer size: 16 units
- Activation function: ReLU (typical for policy networks)
- Output layer: Softmax for action probabilities

Training Configuration

Environment: CartPole-v1 (OpenAI Gym)
Training Episodes: 2,000
Max Steps per Episode: 1,000
Learning Rate: 0.01
Discount Factor (γ): 1.0 (no discounting)
Optimizer: Adam (PyTorch default)

Environment Details

CartPole-v1 is a classic control problem where:

Observation Space: 4-dimensional continuous space
- Cart position: [-4.8, 4.8]
- Cart velocity: [-∞, ∞]
- Pole angle: [-0.418 rad, 0.418 rad]
- Pole angular velocity: [-∞, ∞]
Action Space: 2 discrete actions (0: push left, 1: push right)
Reward: +1 for every step the pole remains upright
Episode Termination:
- Pole angle > ±12°
- Cart position > ±2.4
- Episode length > 500 steps (CartPole-v1 limit)

Training Process

The model was trained using the REINFORCE algorithm with the following key features:

Return Calculation: Monte Carlo returns computed using dynamic programming for efficiency
Reward Standardization: Returns are normalized (zero mean, unit variance) for training stability
Policy Loss: Negative log-probability weighted by standardized returns
Gradient Update: Standard backpropagation with Adam optimizer

Key Implementation Details

Returns calculated in reverse chronological order for computational efficiency
Numerical stability ensured by adding epsilon to standard deviation
Deque data structure used for efficient O(1) operations

Performance

The model is evaluated over 10 episodes after training. Expected performance:

Target: Consistently achieve scores close to 500 (maximum possible in CartPole-v1)
Success Criterion: Average score > 475 over evaluation episodes
Training Stability: 100-episode rolling average tracked during training

Usage

# Load the trained policy
policy = torch.load('policy_model.pth')

# Use the policy to select actions
state = env.reset()
action, log_prob = policy.act(state)

Limitations and Considerations

Environment Specific: Model is specifically trained for CartPole-v1 and won't generalize to other environments
Sample Efficiency: REINFORCE can be sample inefficient compared to modern policy gradient methods
Variance: High variance in policy gradient estimates (not using baseline/critic)
Hyperparameter Sensitivity: Performance may be sensitive to learning rate and network architecture

Ethical Considerations

This is a simple control task with no ethical implications. The model is designed for:

Educational purposes in reinforcement learning
Benchmarking and algorithm development
Research in policy gradient methods

Training Environment

Framework: PyTorch
Environment: OpenAI Gym
Monitoring: 100-episode rolling average for performance tracking

Model Files

policy_model.pth: Trained policy network weights
training_scores.pkl: Training episode scores for analysis

Citation

If you use this model, please cite:

@misc{cartpole-policy-gradient-2024,
  title={CartPole-v1 Policy Gradient Reinforcement Learning Model},
  author={Adilbai},
  year={2024},
  publisher={Hugging Face Hub},
  url={https://huggingface.co/Adilbai/CartPole-v1-policy-gradient-RL}
}

References

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4), 229-256.
OpenAI Gym CartPole-v1 Environment Documentation

For questions or issues with this model, please open an issue in the repository.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Evaluation results

mean_reward on CartPole-v1
self-reported

500.00 +/- 0.00