研究人员开发了加速交互式视频世界模型的新方法,该模型根据用户摄像机移动生成视频内容。“Light Interaction”通过自适应管理上下文和使用去噪缓存提供了一种无需训练的方法,实现了高达 2.59 倍的速度提升。另外,“minWM”框架提供了一个开源管道,用于将现有的视频扩散模型转换为实时交互式世界模型。此外,还引入了一个名为“WBench”的新基准,用于在多个维度上全面评估这些交互式视频世界模型。
AI
arXiv:2603.02697v2 Announce Type: replace-cross Abstract: This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interac…
arXiv cs.AI
TIER_1English(EN)·Teng Hu, Mingchun Lu, Yating Wang, Jiangning Zhang, Jinkun Hao, Ye Pan, Ran Yi, Lizhuang Ma, Dacheng Tao·
arXiv:2606.02753v1 Announce Type: cross Abstract: Video world models are a foundational generative technology for embodied AI and the Metaverse, yet existing approaches are inherently limited to a single agent observing from a single perspective. Extending these models to multi-a…
arXiv:2605.31158v1 Announce Type: cross Abstract: Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, sc…
arXiv cs.AI
TIER_1English(EN)·Taiye Chen, Xun Hu, Zihan Ding, Chi Jin·
arXiv:2505.21996v4 Announce Type: replace-cross Abstract: Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world mode…
Light Interaction accelerates interactive video world models through adaptive computation strategies and optimized attention mechanisms without requiring model retraining.
A comprehensive framework is presented for converting bidirectional video diffusion models into real-time interactive world models with controllable, causal, and low-latency capabilities through fine-tuning and distillation techniques.
arXiv cs.AI
TIER_1English(EN)·Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou·
arXiv:2602.10104v2 Announce Type: replace-cross Abstract: Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across …
WBench presents a comprehensive multi-turn benchmark for evaluating interactive world models across five dimensions using 289 test cases and 1,058 interaction turns with diverse scenarios and interaction types.
arXiv cs.CV
TIER_1English(EN)·Jiuming Liu, Chaojun Ni, Mengmeng Liu, Chensheng Peng, Fangjinhua Wang, Sitian Shen, Marc Pollefeys, Masayoshi Tomizuka, Ayush Tewari, Per Ola Kristensson·
arXiv:2606.01164v1 Announce Type: new Abstract: With rapid development of large language models and diffusion-based content generation, world modeling has attracted increasing research attention, benefiting various downstream domains such as game engines, embodied AI, autonomous …
Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibit…
arXiv:2605.30263v1 Announce Type: new Abstract: Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require contro…
Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require controllable, causal, and low-latency rollout, which i…
arXiv:2605.25874v1 Announce Type: new Abstract: Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive mult…
arXiv cs.CV
TIER_1English(EN)·Bohai Gu, Taiyi Wu, Yueyang Yuan, Jian Liu, Xiaocheng Lu, Dazhao Du, Jie Zhang, Jinxiang Lai, Shuai Yang, Xiaotong Zhao, Alan Zhao, Song Guo·
arXiv:2605.25077v1 Announce Type: new Abstract: Recent video-based world models have made pixel-space environments interactive at the camera level: users can navigate viewpoints while the model generates coherent visual continuations. Yet their action spaces remain incomplete: us…
Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive multi-turn benchmark for interactive world model eva…