AVP architecture enhances robotic manipulation with visual primitives

By PulseAugur Editorial · [1 sources] · 2026-05-26 04:00

Researchers have developed a new architecture called AVP (Action with Visual Primitives) for vision-language-action models in robotics. This approach separates instruction comprehension and scene understanding from motor control, allowing a pre-trained vision-language model to infer target locations and emit visual-primitive tokens. These tokens then condition a separate action expert, leading to improved data efficiency and generalization on real-robot pick-and-place tasks. AI

IMPACT AVP architecture improves robotic manipulation success rates and data efficiency by decoupling perception from action.

RANK_REASON The cluster contains a research paper detailing a new architecture for robotic manipulation. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

AVP architecture enhances robotic manipulation with visual primitives

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Weilong Guo, Yuchen Wang, Renping Zhou, Yunfeng Zhang, Rui Fang, Yuyang Pang, Wenda Xu, Gao Huang · 2026-05-26 04:00

Action with Visual Primitives

arXiv:2605.22183v2 Announce Type: replace-cross Abstract: Vision-Language-Action (VLA) models have emerged as a promising paradigm for generalist robotic manipulation. A common design in current architectures maps language instructions and visual observations to actions in a sing…

COVERAGE [1]

Action with Visual Primitives

RELATED ENTITIES

RELATED TOPICS