New framework trains VLA models on unlabeled human videos

By PulseAugur Editorial · [1 sources] · 2026-06-17 11:37

Researchers have developed a new framework for training Vision-Language-Action (VLA) models using unlabeled human egocentric videos. The system employs a Hybrid Disentangled VQ-VAE to separate motion dynamics from backgrounds, creating a cross-embodiment action codebook. This pre-training allows the VLM backbone to learn action intent, and an intent-perception decoupling strategy further refines predictions by separating action intent from state-specific visual features. The method demonstrates competitive performance compared to state-of-the-art VLA models trained on extensive annotated datasets, requiring minimal downstream adaptation. AI

IMPACT This research could enable more efficient training of VLA models by leveraging abundant unlabeled human video data, potentially reducing the need for costly annotated robotic datasets.

RANK_REASON The cluster contains an academic paper detailing a new method for training AI models. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.CV TIER_1 English(EN) · Jincheng Yu · 2026-06-17 11:37

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

Training generalist Vision-Language-Action(VLA) models typically requires massive, diverse robotic datasets with high-fidelity action annotations. While egocentric human manipulation videos are abundant and capture significant environmental diversity, the absence of action labels…

COVERAGE [1]

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

RELATED ENTITIES

RELATED TOPICS