CartPole-v1 Policy Gradient Reinforcement Learning Model

Model Description

This model is a Policy Gradient (REINFORCE) agent trained to solve the CartPole-v1 environment from OpenAI Gym. The agent learns to balance a pole on a cart by taking discrete actions (left or right) to maximize the cumulative reward.

Model Details

Model Architecture

  • Algorithm: REINFORCE (Monte Carlo Policy Gradient)
  • Neural Network: Simple feedforward network
    • Hidden layer size: 16 units
    • Activation function: ReLU (typical for policy networks)
    • Output layer: Softmax for action probabilities

Training Configuration

  • Environment: CartPole-v1 (OpenAI Gym)
  • Training Episodes: 2,000
  • Max Steps per Episode: 1,000
  • Learning Rate: 0.01
  • Discount Factor (γ): 1.0 (no discounting)
  • Optimizer: Adam (PyTorch default)

Environment Details

CartPole-v1 is a classic control problem where:

  • Observation Space: 4-dimensional continuous space
    • Cart position: [-4.8, 4.8]
    • Cart velocity: [-∞, ∞]
    • Pole angle: [-0.418 rad, 0.418 rad]
    • Pole angular velocity: [-∞, ∞]
  • Action Space: 2 discrete actions (0: push left, 1: push right)
  • Reward: +1 for every step the pole remains upright
  • Episode Termination:
    • Pole angle > ±12°
    • Cart position > ±2.4
    • Episode length > 500 steps (CartPole-v1 limit)

Training Process

The model was trained using the REINFORCE algorithm with the following key features:

  1. Return Calculation: Monte Carlo returns computed using dynamic programming for efficiency
  2. Reward Standardization: Returns are normalized (zero mean, unit variance) for training stability
  3. Policy Loss: Negative log-probability weighted by standardized returns
  4. Gradient Update: Standard backpropagation with Adam optimizer

Key Implementation Details

  • Returns calculated in reverse chronological order for computational efficiency
  • Numerical stability ensured by adding epsilon to standard deviation
  • Deque data structure used for efficient O(1) operations

Performance

The model is evaluated over 10 episodes after training. Expected performance:

  • Target: Consistently achieve scores close to 500 (maximum possible in CartPole-v1)
  • Success Criterion: Average score > 475 over evaluation episodes
  • Training Stability: 100-episode rolling average tracked during training

Usage

# Load the trained policy
policy = torch.load('policy_model.pth')

# Use the policy to select actions
state = env.reset()
action, log_prob = policy.act(state)

Limitations and Considerations

  1. Environment Specific: Model is specifically trained for CartPole-v1 and won't generalize to other environments
  2. Sample Efficiency: REINFORCE can be sample inefficient compared to modern policy gradient methods
  3. Variance: High variance in policy gradient estimates (not using baseline/critic)
  4. Hyperparameter Sensitivity: Performance may be sensitive to learning rate and network architecture

Ethical Considerations

This is a simple control task with no ethical implications. The model is designed for:

  • Educational purposes in reinforcement learning
  • Benchmarking and algorithm development
  • Research in policy gradient methods

Training Environment

  • Framework: PyTorch
  • Environment: OpenAI Gym
  • Monitoring: 100-episode rolling average for performance tracking

Model Files

  • policy_model.pth: Trained policy network weights
  • training_scores.pkl: Training episode scores for analysis

Citation

If you use this model, please cite:

@misc{cartpole-policy-gradient-2024,
  title={CartPole-v1 Policy Gradient Reinforcement Learning Model},
  author={Adilbai},
  year={2024},
  publisher={Hugging Face Hub},
  url={https://huggingface.co/Adilbai/CartPole-v1-policy-gradient-RL}
}

References

  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
  • Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4), 229-256.
  • OpenAI Gym CartPole-v1 Environment Documentation

For questions or issues with this model, please open an issue in the repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Evaluation results