Scaling Behavior Cloning Improves Causal Reasoning: An Open Model for Real-Time Video Game Playing
Abstract
Behavior cloning demonstrates improved performance and causal reasoning through scaling model size and training data, achieving human-level gameplay in 3D video games.
Behavior cloning is enjoying a resurgence in popularity as scaling both model and data sizes proves to provide a strong starting point for many tasks of interest. In this work, we introduce an open recipe for training a video game playing foundation model designed for inference in realtime on a consumer GPU. We release all data (8300+ hours of high quality human gameplay), training and inference code, and pretrained checkpoints under an open license. We show that our best model is capable of playing a variety of 3D video games at a level competitive with human play. We use this recipe to systematically examine the scaling laws of behavior cloning to understand how the model's performance and causal reasoning varies with model and data scale. We first show in a simple toy problem that, for some types of causal reasoning, increasing both the amount of training data and the depth of the network results in the model learning a more causal policy. We then systematically study how causality varies with the number of parameters (and depth) and training steps in scaled models of up to 1.2 billion parameters, and we find similar scaling results to what we observe in the toy problem.
Community
We introduce Pixels2Play (P2P), an open-source generalist agent designed for real-time control across diverse 3D video games on consumer-grade GPUs. Built on an efficient, decoder-only transformer architecture that predicts keyboard and mouse actions from raw pixel inputs , the model is trained on a massive new dataset of over 8,300 hours of high-quality, text-annotated human gameplay. Beyond achieving human-level competence in commercial environments like DOOM and Roblox , we systematically investigate the scaling laws of behavior cloning, demonstrating that increasing model and data scale significantly improves causal reasoning and mitigates the "causal confusion" often inherent in imitation learning. To accelerate research in generalist game AI, we are releasing the full training recipe, model checkpoints, and our extensive gameplay dataset under an open license
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- NitroGen: An Open Foundation Model for Generalist Gaming Agents (2026)
- Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model (2025)
- ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models (2025)
- T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground (2025)
- Scalable Policy Evaluation with Video World Models (2025)
- Fast and Accurate Causal Parallel Decoding using Jacobi Forcing (2025)
- Falcon-H1R: Pushing the Reasoning Frontiers with a Hybrid Model for Efficient Test-Time Scaling (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper