New VLA model compresses frames to single tokens, boosting performance

By PulseAugur Editorial · [1 sources] · 2026-05-08 16:04

Researchers have developed a new approach called OneWM-VLA for vision-language-action (VLA) models, which optimizes how visual information is processed for long-horizon planning. This method compresses each frame into a single semantic token, significantly reducing visual bandwidth without sacrificing performance. Trained with a relatively small number of parameters on a 2B backbone, OneWM-VLA has demonstrated substantial improvements in success rates across multiple challenging benchmarks, including MetaWorld MT50 and LIBERO-Long, and shows promise on real-world robotic tasks. AI

IMPACT This research could lead to more efficient and capable vision-language-action models for robotics and long-horizon planning tasks.

RANK_REASON The cluster contains a new academic paper detailing a novel model architecture and its performance improvements on benchmarks. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New VLA model compresses frames to single tokens, boosting performance

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Bin Liu · 2026-05-08 16:04

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame v…

COVERAGE [1]

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

RELATED ENTITIES

RELATED TOPICS