Brief · PulseAugur

RESEARCH · arXiv cs.CV English(EN) · 4d · [2 sources]

Encoder Winners Do Not Reliably Transfer Across VLA Backbone Scale: A Frozen-Backbone Grafting Diagnostic

A new diagnostic method called frozen-backbone grafting has been developed to evaluate vision encoders for vision-language-action (VLA) policies. This method tests whether an encoder that performs well on a smaller VLA backbone also performs well on a larger one. Experiments across different encoders, VLA suites, and backbones (SmolVLA-450M and $\pi_{0.5}$-3.3B) revealed that the optimal encoder choice is often dependent on the backbone scale and specific task suite, indicating that small-backbone validation does not reliably predict large-backbone performance. The researchers propose this diagnostic as a cost-effective tool for selecting encoders before scaling up. AI

IMPACT Highlights the need for backbone-specific encoder selection in VLA policies, suggesting current small-scale validation may not translate to larger models.

SigLIP
LIBERO
DINOv2-small
SmolVLA-450M
$\pi_{0.5}$-3.3B