minecraft-rl-gathering

A Minecraft RL agent trained with PPO (Proximal Policy Optimization) using Stable-Baselines3.

This agent was trained to gather resources in Minecraft using a distributed training architecture across RTX 5090 (training), Jetson Orin AGX (environment), and DGX Spark (LLM reward shaping).

Training Details

Metric	Value
Total Steps	230,920
Episodes	~85
Mean Reward	-141.5
Best Reward	+50.7
Reward Scheme	gathering
Learning Rate	0.0003

Hardware

Training: NVIDIA RTX 5090 (32GB VRAM)
Environment: NVIDIA Jetson Orin AGX (64GB RAM)
LLM Server: NVIDIA DGX Spark - GPT-OSS-20B (vLLM)

Architecture

Algorithm: PPO (Proximal Policy Optimization)
Policy: MLP with [512, 512] hidden layers
Observation Space: 82 dimensions (position, velocity, vitals, hotbar, craftable flags)
Action Space: 37 discrete actions (movement, mining, crafting, inventory)

Observation Space (82 dimensions)

Component	Dimensions	Description
Position	3	x, y, z normalized
Velocity	3	vx, vy, vz
Orientation	2	yaw, pitch normalized
Vitals	4	health, food, saturation, oxygen
Flags	2	is_on_ground, is_day
Time	1	time_of_day normalized
Hotbar	18	9 slots × (item_type + count)
Held Item	3	type, count, durability
Craftable	8	can_craft flags for key items
Block Grid	27	3×3×3 nearby blocks
Nearby Entities	11	closest entity info

Action Space (37 discrete actions)

Category	Actions
Movement (0-7)	forward, back, left, right, jump, jump_forward, sprint_forward, forward_long
Looking (8-11)	look_left, look_right, look_up, look_down
Mining (12-16)	mine, attack, mine_forward, mine_up, jump_mine_up
Hotbar (17-25)	select_slot_0 through select_slot_8
Items (26-28)	place_block, eat_food, use_item
Crafting (29-36)	craft_planks, craft_sticks, craft_crafting_table, craft_wooden_pickaxe, craft_stone_pickaxe, craft_wooden_sword, craft_furnace, craft_torch

Usage

from huggingface_hub import hf_hub_download
from stable_baselines3 import PPO

# Download model
hf_hub_download(
    repo_id='CahlenLee/minecraft-rl-gathering',
    filename='model.zip',
    local_dir='./models'
)

# Load and use
model = PPO.load('./models/model.zip')

# Run inference
obs = env.reset()
action, _ = model.predict(obs, deterministic=True)

Environment Setup

This model was trained on a custom Minecraft environment using:

Mineflayer for bot control
Custom Gymnasium wrapper for RL interface
LLM-based reward shaping (GPT-OSS-20B via vLLM)
Dense rewards for resource gathering

Training Configuration

PPO(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    n_steps=256,
    batch_size=256,
    n_epochs=15,
    gamma=0.99,
    gae_lambda=0.95,
    ent_coef=0.02,
    clip_range=0.2,
    max_grad_norm=0.5,
    policy_kwargs={"net_arch": {"pi": [512, 512], "vf": [512, 512]}},
)

Distributed Architecture

┌──────────────┐     HTTP/REST      ┌──────────────────────────┐
│  RTX 5090    │◄──────────────────►│  Jetson Orin AGX         │
│  Training    │   Bot Steps/Obs    │  4x Minecraft Bots       │
│  Server      │                    │  Dashboard (port 3000)   │
└──────┬───────┘                    └──────────────────────────┘
       │
       │ Async LLM Queries
       ▼
┌──────────────┐
│  DGX Spark   │
│  vLLM Server │
│  (20B model) │
└──────────────┘

License

MIT

Citation

If you use this model, please cite:

@misc{minecraft_rl_gathering,
  author = {Cahlen Humphreys},
  title = {minecraft-rl-gathering},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/cahlen/minecraft-rl-gathering}}
}

Downloads last month: 19

Video Preview

Reinforcement Learning

cahlen
/

minecraft-voyager-gathering-230k