New protocol measures commonsense knowledge in VLA models

By PulseAugur Editorial · [1 sources] · 2026-06-17 17:20

Researchers have developed Act2Answer, a new evaluation protocol designed to assess the commonsense and world knowledge retained by Vision-Language-Action (VLA) models after fine-tuning on robotics data. This protocol adapts existing VLM knowledge benchmarks by having agents select answers through specific actions in tabletop environments, thereby reducing confounds related to low-level control. A large-scale study of seven VLA models and nine VLM baselines revealed that while VLA models perform well on simple concepts, they show greater knowledge gaps in complex semantic areas compared to their source VLMs. The study also indicated that VQA co-training aids knowledge retention and that relevant signals are strongest in the middle layers of VLA models. AI

IMPACT This new evaluation method could lead to more accurate assessments of VLA model capabilities, driving improvements in embodied AI and robotics.

RANK_REASON The cluster describes a new research paper introducing an evaluation protocol for VLA models. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New protocol measures commonsense knowledge in VLA models

COVERAGE [1]

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-17 17:20

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Embodied Vision-Language-Action (VLA) models are typically obtained by fine-tuning powerful pretrained VLMs on robotics data, yet it is unclear how much commonsense and factual knowledge they retain after adaptation. Failures on knowledge-sensitive tasks are ambiguous, conflating…

COVERAGE [1]

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

RELATED ENTITIES

RELATED TOPICS