StarVLA
/

Florence-GR00T-Bridge-RT-1

Model card Files Files and versions

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

BridgeRT Benchmark Evaluation Results

Performance Summary

Training Steps	PutCarrotOnPlateInScene	PutEggplantInBasketScene	PutSpoonOnTableClothInScene	StackGreenCubeOnYellowCubeBakedTexInScene	Average Across Tasks
10,000	0.2083	0.8750	0.8750	0.0417	0.5000
20,000	0.4063	0.6146	0.6979	0.1354	0.4635
30,000	0.4896	0.7813	0.3854	0.1771	0.4583
40,000	0.4688	0.7500	0.3854	0.1458	0.4375
50,000	0.3750	0.5521	0.5000	0.0938	0.3802
60,000	0.3646	0.6042	0.4583	0.1042	0.3828
70,000	0.3958	0.7813	0.4688	0.1250	0.4427
80,000	0.4271	0.8125	0.5208	0.0521	0.4531
90,000	0.4063	0.5313	0.5833	0.1250	0.4115
100,000	0.4479	0.5938	0.7500	0.1042	0.4740

Performance Analysis

📊 Task Difficulty Ranking (by Final Scores)

PutSpoonOnTableClothInScene: 0.7500 (highest at 100k)
PutEggplantInBasketScene: 0.5938 (at 100k)
PutCarrotOnPlateInScene: 0.4479 (at 100k)
StackGreenCubeOnYellowCubeBakedTexInScene: 0.1042 (at 100k) - Most challenging

🏆 Best Performance by Task:

PutCarrotOnPlateInScene: 0.4896 (at 30k)
PutEggplantInBasketScene: 0.8750 (at 10k) ⚠️ Early overfitting
PutSpoonOnTableClothInScene: 0.8750 (at 10k) ⚠️ Early overfitting
StackGreenCubeOnYellowCubeBakedTexInScene: 0.1771 (at 30k)

📈 Training Progression Analysis:

Early Training (10k steps): Strong start on PutEggplantInBasketScene and PutSpoonOnTableClothInScene but poor on other tasks
Mid Training (20k-40k steps): Most balanced performance across tasks
Late Training (50k-100k steps): Performance fluctuates, showing signs of overfitting on some tasks

🔍 Key Observations:

StackGreenCubeOnYellowCubeBakedTexInScene is extremely challenging (max 0.1771 success rate)
Early overfitting: Some tasks show best performance at 10k steps, then degrade
No clear convergence: Performance doesn't consistently improve with more training
Optimal checkpoint: 10,000 steps gives best average (0.5000) but likely overfitted

Recommendations

For Deployment:

Conservative choice: 20,000 steps (0.4635 average, more stable performance)
Balanced choice: 30,000 steps (0.4583 average, best StackGreenCube performance)
Avoid: 10,000 steps despite high average (likely overfitted to specific tasks)

Downloads last month: 12

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support