Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 5d

Action with Visual Primitives

Researchers have developed a new architecture called AVP (Action with Visual Primitives) for vision-language-action models in robotics. This approach separates instruction comprehension and scene understanding from motor control, allowing a pre-trained vision-language model to infer target locations and emit visual-primitive tokens. These tokens then condition a separate action expert, leading to improved data efficiency and generalization on real-robot pick-and-place tasks. AI

IMPACT AVP architecture improves robotic manipulation success rates and data efficiency by decoupling perception from action.

arXiv
Weilong Guo
AVP