VLA emerges as top solution for embodied AI, despite sensory limitations

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Visual-Language-Action (VLA) models are currently the leading architecture for embodied AI due to their strong task generalization capabilities. However, VLA has limitations, particularly in tactile and proprioceptive sensing, which are crucial for certain human actions like rotating a basketball. Haozhi Qi, a scientist at Amazon's AI and Robotics Research Lab, suggests that VLA's popularity is linked to the current maturity of visual sensors compared to less developed tactile sensors. He posits that embodied systems need to integrate other sensory inputs to compensate for less advanced sensing modalities, making VLA a strong contender for the best solution by leveraging vision and language to address tactile deficiencies. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT VLA's dominance in embodied AI is questioned, highlighting the need for multi-modal sensing beyond vision to overcome current hardware limitations.

RANK_REASON Discusses a current architectural paradigm (VLA) for embodied AI and its limitations, citing a researcher's perspective.

Read on 36氪 (36Kr) →

COVERAGE [1]

36氪 (36Kr) TIER_1 中文(ZH) · 2026-05-02 13:01

Is VLA (Vision-Language-Action) the Best Solution for Embodied "Brains"?

由于强大的任务泛化能力，当下VLA已经成为具身模型最主流的架构范式。但事实上，当人类用手指旋转一个篮球时，只用依靠触觉和本体感知，并不需要视觉——这意味着，VLA在这两个感知系统上，存在短板。在GEIS大会上，亚马逊前沿AI与机器人研究院科学家Haozhi Qi认为， VLA的流行，与硬件传感器的发展程度有关：当下，视觉传感器趋于成熟，但触觉传感器还在初级开发阶段。因此，在他看来，具身系统需要通过其他感觉的输入，来补足不太成熟的传感系统，从而维持本体的操作。因此，通过视觉和语言补足触觉缺陷的VLA，成了当下最好的解决方案之一。不过，未来随着传

COVERAGE [1]

Is VLA (Vision-Language-Action) the Best Solution for Embodied "Brains"?

RELATED ENTITIES

RELATED TOPICS