Encoder Winners Do Not Reliably Transfer Across VLA Backbone Scale: A Frozen-Backbone Grafting Diagnostic
A new diagnostic method called frozen-backbone grafting has been developed to evaluate vision encoders for vision-language-action (VLA) policies. This method tests whether an encoder that performs well on a smaller VLA backbone also performs well on a larger one. Experiments across different encoders, VLA suites, and backbones (SmolVLA-450M and $\pi_{0.5}$-3.3B) revealed that the optimal encoder choice is often dependent on the backbone scale and specific task suite, indicating that small-backbone validation does not reliably predict large-backbone performance. The researchers propose this diagnostic as a cost-effective tool for selecting encoders before scaling up. AI
IMPACT Highlights the need for backbone-specific encoder selection in VLA policies, suggesting current small-scale validation may not translate to larger models.