YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

BridgeRT Benchmark Evaluation Results

Performance Summary

Training Steps PutCarrotOnPlateInScene PutEggplantInBasketScene PutSpoonOnTableClothInScene StackGreenCubeOnYellowCubeBakedTexInScene Average Across Tasks
10,000 0.2083 0.8750 0.8750 0.0417 0.5000
20,000 0.4063 0.6146 0.6979 0.1354 0.4635
30,000 0.4896 0.7813 0.3854 0.1771 0.4583
40,000 0.4688 0.7500 0.3854 0.1458 0.4375
50,000 0.3750 0.5521 0.5000 0.0938 0.3802
60,000 0.3646 0.6042 0.4583 0.1042 0.3828
70,000 0.3958 0.7813 0.4688 0.1250 0.4427
80,000 0.4271 0.8125 0.5208 0.0521 0.4531
90,000 0.4063 0.5313 0.5833 0.1250 0.4115
100,000 0.4479 0.5938 0.7500 0.1042 0.4740

Performance Analysis

πŸ“Š Task Difficulty Ranking (by Final Scores)

  1. PutSpoonOnTableClothInScene: 0.7500 (highest at 100k)
  2. PutEggplantInBasketScene: 0.5938 (at 100k)
  3. PutCarrotOnPlateInScene: 0.4479 (at 100k)
  4. StackGreenCubeOnYellowCubeBakedTexInScene: 0.1042 (at 100k) - Most challenging

πŸ† Best Performance by Task:

  • PutCarrotOnPlateInScene: 0.4896 (at 30k)
  • PutEggplantInBasketScene: 0.8750 (at 10k) ⚠️ Early overfitting
  • PutSpoonOnTableClothInScene: 0.8750 (at 10k) ⚠️ Early overfitting
  • StackGreenCubeOnYellowCubeBakedTexInScene: 0.1771 (at 30k)

πŸ“ˆ Training Progression Analysis:

  1. Early Training (10k steps): Strong start on PutEggplantInBasketScene and PutSpoonOnTableClothInScene but poor on other tasks
  2. Mid Training (20k-40k steps): Most balanced performance across tasks
  3. Late Training (50k-100k steps): Performance fluctuates, showing signs of overfitting on some tasks

πŸ” Key Observations:

  1. StackGreenCubeOnYellowCubeBakedTexInScene is extremely challenging (max 0.1771 success rate)
  2. Early overfitting: Some tasks show best performance at 10k steps, then degrade
  3. No clear convergence: Performance doesn't consistently improve with more training
  4. Optimal checkpoint: 10,000 steps gives best average (0.5000) but likely overfitted

Recommendations

For Deployment:

  • Conservative choice: 20,000 steps (0.4635 average, more stable performance)
  • Balanced choice: 30,000 steps (0.4583 average, best StackGreenCube performance)
  • Avoid: 10,000 steps despite high average (likely overfitted to specific tasks)
Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support