minecraft-rl-gathering

A Minecraft RL agent trained with PPO (Proximal Policy Optimization) using Stable-Baselines3.

This agent was trained to gather resources in Minecraft using a distributed training architecture across RTX 5090 (training), Jetson Orin AGX (environment), and DGX Spark (LLM reward shaping).

Training Details

Metric Value
Total Steps 230,920
Episodes ~85
Mean Reward -141.5
Best Reward +50.7
Reward Scheme gathering
Learning Rate 0.0003

Hardware

  • Training: NVIDIA RTX 5090 (32GB VRAM)
  • Environment: NVIDIA Jetson Orin AGX (64GB RAM)
  • LLM Server: NVIDIA DGX Spark - GPT-OSS-20B (vLLM)

Architecture

  • Algorithm: PPO (Proximal Policy Optimization)
  • Policy: MLP with [512, 512] hidden layers
  • Observation Space: 82 dimensions (position, velocity, vitals, hotbar, craftable flags)
  • Action Space: 37 discrete actions (movement, mining, crafting, inventory)

Observation Space (82 dimensions)

Component Dimensions Description
Position 3 x, y, z normalized
Velocity 3 vx, vy, vz
Orientation 2 yaw, pitch normalized
Vitals 4 health, food, saturation, oxygen
Flags 2 is_on_ground, is_day
Time 1 time_of_day normalized
Hotbar 18 9 slots Γ— (item_type + count)
Held Item 3 type, count, durability
Craftable 8 can_craft flags for key items
Block Grid 27 3Γ—3Γ—3 nearby blocks
Nearby Entities 11 closest entity info

Action Space (37 discrete actions)

Category Actions
Movement (0-7) forward, back, left, right, jump, jump_forward, sprint_forward, forward_long
Looking (8-11) look_left, look_right, look_up, look_down
Mining (12-16) mine, attack, mine_forward, mine_up, jump_mine_up
Hotbar (17-25) select_slot_0 through select_slot_8
Items (26-28) place_block, eat_food, use_item
Crafting (29-36) craft_planks, craft_sticks, craft_crafting_table, craft_wooden_pickaxe, craft_stone_pickaxe, craft_wooden_sword, craft_furnace, craft_torch

Usage

from huggingface_hub import hf_hub_download
from stable_baselines3 import PPO

# Download model
hf_hub_download(
    repo_id='CahlenLee/minecraft-rl-gathering',
    filename='model.zip',
    local_dir='./models'
)

# Load and use
model = PPO.load('./models/model.zip')

# Run inference
obs = env.reset()
action, _ = model.predict(obs, deterministic=True)

Environment Setup

This model was trained on a custom Minecraft environment using:

  • Mineflayer for bot control
  • Custom Gymnasium wrapper for RL interface
  • LLM-based reward shaping (GPT-OSS-20B via vLLM)
  • Dense rewards for resource gathering

Training Configuration

PPO(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    n_steps=256,
    batch_size=256,
    n_epochs=15,
    gamma=0.99,
    gae_lambda=0.95,
    ent_coef=0.02,
    clip_range=0.2,
    max_grad_norm=0.5,
    policy_kwargs={"net_arch": {"pi": [512, 512], "vf": [512, 512]}},
)

Distributed Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     HTTP/REST      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  RTX 5090    │◄──────────────────►│  Jetson Orin AGX         β”‚
β”‚  Training    β”‚   Bot Steps/Obs    β”‚  4x Minecraft Bots       β”‚
β”‚  Server      β”‚                    β”‚  Dashboard (port 3000)   β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β”‚ Async LLM Queries
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  DGX Spark   β”‚
β”‚  vLLM Server β”‚
β”‚  (20B model) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

License

MIT

Citation

If you use this model, please cite:

@misc{minecraft_rl_gathering,
  author = {Cahlen Humphreys},
  title = {minecraft-rl-gathering},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/cahlen/minecraft-rl-gathering}}
}
Downloads last month
19
Video Preview
loading

Evaluation results