VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training
Researchers have developed VISTA, a framework designed to improve the training of Vision-Language-Action (VLA) models using real-world robot data. The framework addresses challenges such as distorted camera views and physically infeasible human-collected trajectories. VISTA incorporates a new dataset (UMI-VQA) for distorted visual inputs and a validation pipeline to filter out unsafe or impossible robot actions, leading to better policy performance. AI
IMPACT Enhances robot learning by enabling more robust training from real-world data, potentially improving deployment success.