PulseAugur
实时 12:46:07

New VLA model compresses frames to single tokens, boosting performance

Researchers have developed a new approach called OneWM-VLA for vision-language-action (VLA) models, which optimizes how visual information is processed for long-horizon planning. This method compresses each frame into a single semantic token, significantly reducing visual bandwidth without sacrificing performance. Trained with a relatively small number of parameters on a 2B backbone, OneWM-VLA has demonstrated substantial improvements in success rates across multiple challenging benchmarks, including MetaWorld MT50 and LIBERO-Long, and shows promise on real-world robotic tasks. AI

影响 This research could lead to more efficient and capable vision-language-action models for robotics and long-horizon planning tasks.

排序理由 The cluster contains a new academic paper detailing a novel model architecture and its performance improvements on benchmarks. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

New VLA model compresses frames to single tokens, boosting performance

报道来源 [1]

  1. arXiv cs.AI TIER_1 English(EN) · Bin Liu ·

    One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame v…