PulseAugur
实时 17:44:56

新框架和基准加速交互式视频世界模型

研究人员开发了加速交互式视频世界模型的新方法,该模型根据用户摄像机移动生成视频内容。“Light Interaction”通过自适应管理上下文和使用去噪缓存提供了一种无需训练的方法,实现了高达 2.59 倍的速度提升。另外,“minWM”框架提供了一个开源管道,用于将现有的视频扩散模型转换为实时交互式世界模型。此外,还引入了一个名为“WBench”的新基准,用于在多个维度上全面评估这些交互式视频世界模型。 AI

影响 交互式视频生成和世界模型方面的进步可以实现更逼真的模拟和具身人工智能训练。

排序理由 多篇研究论文介绍了交互式视频世界模型的新方法、框架和基准。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 15 个来源。 我们如何撰写摘要 →

报道来源 [15]

  1. arXiv cs.AI TIER_1 English(EN) · Jiayi Zhu, Jianing Zhang, Yiying Yang, Wei Cheng, Xiaoyun Yuan ·

    ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

    arXiv:2603.02697v2 Announce Type: replace-cross Abstract: This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interac…

  2. arXiv cs.AI TIER_1 English(EN) · Teng Hu, Mingchun Lu, Yating Wang, Jiangning Zhang, Jinkun Hao, Ye Pan, Ran Yi, Lizhuang Ma, Dacheng Tao ·

    MetaWorld:从单视角视频数据扩展多智能体视频世界模型

    arXiv:2606.02753v1 Announce Type: cross Abstract: Video world models are a foundational generative technology for embodied AI and the Metaverse, yet existing approaches are inherently limited to a single agent observing from a single perspective. Extending these models to multi-a…

  3. arXiv cs.LG TIER_1 English(EN) · Jiacheng Lu, Haoyi Zhu, Sipei Yi, Enze Xie, Yu Li, Cheng Zhuo ·

    Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

    arXiv:2605.31158v1 Announce Type: cross Abstract: Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, sc…

  4. arXiv cs.AI TIER_1 English(EN) · Taiye Chen, Xun Hu, Zihan Ding, Chi Jin ·

    VRAG: Learning World Models for Interactive Video Generation

    arXiv:2505.21996v4 Announce Type: replace-cross Abstract: Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world mode…

  5. Hugging Face Daily Papers TIER_1 English(EN) ·

    Light Interaction:面向交互式视频世界模型的无训练推理加速

    Light Interaction accelerates interactive video world models through adaptive computation strategies and optimized attention mechanisms without requiring model retraining.

  6. Hugging Face Daily Papers TIER_1 English(EN) ·

    minWM:一个用于实时交互式视频世界模型的全栈开源框架

    A comprehensive framework is presented for converting bidirectional video diffusion models into real-time interactive world models with controllable, causal, and low-latency capabilities through fine-tuning and distillation techniques.

  7. arXiv cs.AI TIER_1 English(EN) · Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou ·

    Olaf-World: Orienting Latent Actions for Video World Modeling

    arXiv:2602.10104v2 Announce Type: replace-cross Abstract: Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across …

  8. Hugging Face Daily Papers TIER_1 English(EN) ·

    WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

    WBench presents a comprehensive multi-turn benchmark for evaluating interactive world models across five dimensions using 289 test cases and 1,058 interaction turns with diverse scenarios and interaction types.

  9. arXiv cs.CV TIER_1 English(EN) · Jiuming Liu, Chaojun Ni, Mengmeng Liu, Chensheng Peng, Fangjinhua Wang, Sitian Shen, Marc Pollefeys, Masayoshi Tomizuka, Ayush Tewari, Per Ola Kristensson ·

    Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends

    arXiv:2606.01164v1 Announce Type: new Abstract: With rapid development of large language models and diffusion-based content generation, world modeling has attracted increasing research attention, benefiting various downstream domains such as game engines, embodied AI, autonomous …

  10. arXiv cs.CV TIER_1 English(EN) · Cheng Zhuo ·

    Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

    Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibit…

  11. arXiv cs.CV TIER_1 English(EN) · Min Zhao, Hongzhou Zhu, Bokai Yan, Zihan Zhou, Yimin Chen, Wenqiang Sun, Kaiwen Zheng, Guande He, Xiao Yang, Chongxuan Li, Fan Bao, Jun Zhu ·

    minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

    arXiv:2605.30263v1 Announce Type: new Abstract: Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require contro…

  12. arXiv cs.CV TIER_1 English(EN) · Jun Zhu ·

    minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

    Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require controllable, causal, and low-latency rollout, which i…

  13. arXiv cs.CV TIER_1 English(EN) · Kaining Ying, Hengrui Hu, Siyu Ren, Jiamu Li, Fengjiao Chen, Ziwen Wang, Xuezhi Cao, Xunliang Cai, Henghui Ding ·

    WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

    arXiv:2605.25874v1 Announce Type: new Abstract: Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive mult…

  14. arXiv cs.CV TIER_1 English(EN) · Bohai Gu, Taiyi Wu, Yueyang Yuan, Jian Liu, Xiaocheng Lu, Dazhao Du, Jie Zhang, Jinxiang Lai, Shuai Yang, Xiaotong Zhao, Alan Zhao, Song Guo ·

    WorldCraft: From Camera Navigation to Object Manipulation in Interactive Video World Models

    arXiv:2605.25077v1 Announce Type: new Abstract: Recent video-based world models have made pixel-space environments interactive at the camera level: users can navigate viewpoints while the model generates coherent visual continuations. Yet their action spaces remain incomplete: us…

  15. arXiv cs.CV TIER_1 English(EN) · Henghui Ding ·

    WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

    Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive multi-turn benchmark for interactive world model eva…