Brief · PulseAugur

TOOL · arXiv cs.CV English(EN) · 19h

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

Researchers have developed a new framework for training Vision-Language-Action (VLA) models using unlabeled human egocentric videos. The system employs a Hybrid Disentangled VQ-VAE to separate motion dynamics from backgrounds, creating a cross-embodiment action codebook. This pre-training allows the VLM backbone to learn action intent, and an intent-perception decoupling strategy further refines predictions by separating action intent from state-specific visual features. The method demonstrates competitive performance compared to state-of-the-art VLA models trained on extensive annotated datasets, requiring minimal downstream adaptation. AI

IMPACT This research could enable more efficient training of VLA models by leveraging abundant unlabeled human video data, potentially reducing the need for costly annotated robotic datasets.

Vision-Language-Action (VLA) models
Hybrid Disentangled VQ-VAE
human ego-videos