YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
BridgeRT Benchmark Evaluation Results
Performance Summary
| Training Steps | PutCarrotOnPlateInScene | PutEggplantInBasketScene | PutSpoonOnTableClothInScene | StackGreenCubeOnYellowCubeBakedTexInScene | Average Across Tasks |
|---|---|---|---|---|---|
| 10,000 | 0.2083 | 0.8750 | 0.8750 | 0.0417 | 0.5000 |
| 20,000 | 0.4063 | 0.6146 | 0.6979 | 0.1354 | 0.4635 |
| 30,000 | 0.4896 | 0.7813 | 0.3854 | 0.1771 | 0.4583 |
| 40,000 | 0.4688 | 0.7500 | 0.3854 | 0.1458 | 0.4375 |
| 50,000 | 0.3750 | 0.5521 | 0.5000 | 0.0938 | 0.3802 |
| 60,000 | 0.3646 | 0.6042 | 0.4583 | 0.1042 | 0.3828 |
| 70,000 | 0.3958 | 0.7813 | 0.4688 | 0.1250 | 0.4427 |
| 80,000 | 0.4271 | 0.8125 | 0.5208 | 0.0521 | 0.4531 |
| 90,000 | 0.4063 | 0.5313 | 0.5833 | 0.1250 | 0.4115 |
| 100,000 | 0.4479 | 0.5938 | 0.7500 | 0.1042 | 0.4740 |
Performance Analysis
π Task Difficulty Ranking (by Final Scores)
- PutSpoonOnTableClothInScene: 0.7500 (highest at 100k)
- PutEggplantInBasketScene: 0.5938 (at 100k)
- PutCarrotOnPlateInScene: 0.4479 (at 100k)
- StackGreenCubeOnYellowCubeBakedTexInScene: 0.1042 (at 100k) - Most challenging
π Best Performance by Task:
- PutCarrotOnPlateInScene: 0.4896 (at 30k)
- PutEggplantInBasketScene: 0.8750 (at 10k) β οΈ Early overfitting
- PutSpoonOnTableClothInScene: 0.8750 (at 10k) β οΈ Early overfitting
- StackGreenCubeOnYellowCubeBakedTexInScene: 0.1771 (at 30k)
π Training Progression Analysis:
- Early Training (10k steps): Strong start on PutEggplantInBasketScene and PutSpoonOnTableClothInScene but poor on other tasks
- Mid Training (20k-40k steps): Most balanced performance across tasks
- Late Training (50k-100k steps): Performance fluctuates, showing signs of overfitting on some tasks
π Key Observations:
- StackGreenCubeOnYellowCubeBakedTexInScene is extremely challenging (max 0.1771 success rate)
- Early overfitting: Some tasks show best performance at 10k steps, then degrade
- No clear convergence: Performance doesn't consistently improve with more training
- Optimal checkpoint: 10,000 steps gives best average (0.5000) but likely overfitted
Recommendations
For Deployment:
- Conservative choice: 20,000 steps (0.4635 average, more stable performance)
- Balanced choice: 30,000 steps (0.4583 average, best StackGreenCube performance)
- Avoid: 10,000 steps despite high average (likely overfitted to specific tasks)
- Downloads last month
- 12
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support