New frameworks and benchmarks accelerate interactive video world models
ByPulseAugur Editorial·[15 sources]·
Researchers have developed new methods to accelerate interactive video world models, which generate video content based on user camera movements. "Light Interaction" offers a training-free approach by adaptively managing context and using a denoising cache, achieving up to 2.59x speedup. Separately, the "minWM" framework provides an open-source pipeline for converting existing video diffusion models into real-time interactive world models. Additionally, a new benchmark called "WBench" has been introduced to comprehensively evaluate these interactive video world models across various dimensions.
AI
IMPACT
Advances in interactive video generation and world modeling could enable more realistic simulations and embodied AI training.
RANK_REASON
Multiple research papers introducing new methods, frameworks, and benchmarks for interactive video world models.
arXiv:2603.02697v2 Announce Type: replace-cross Abstract: This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interac…
arXiv cs.AI
TIER_1English(EN)·Teng Hu, Mingchun Lu, Yating Wang, Jiangning Zhang, Jinkun Hao, Ye Pan, Ran Yi, Lizhuang Ma, Dacheng Tao·
arXiv:2606.02753v1 Announce Type: cross Abstract: Video world models are a foundational generative technology for embodied AI and the Metaverse, yet existing approaches are inherently limited to a single agent observing from a single perspective. Extending these models to multi-a…
arXiv:2605.31158v1 Announce Type: cross Abstract: Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, sc…
arXiv cs.AI
TIER_1English(EN)·Taiye Chen, Xun Hu, Zihan Ding, Chi Jin·
arXiv:2505.21996v4 Announce Type: replace-cross Abstract: Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world mode…
Light Interaction accelerates interactive video world models through adaptive computation strategies and optimized attention mechanisms without requiring model retraining.
A comprehensive framework is presented for converting bidirectional video diffusion models into real-time interactive world models with controllable, causal, and low-latency capabilities through fine-tuning and distillation techniques.
arXiv cs.AI
TIER_1English(EN)·Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou·
arXiv:2602.10104v2 Announce Type: replace-cross Abstract: Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across …
WBench presents a comprehensive multi-turn benchmark for evaluating interactive world models across five dimensions using 289 test cases and 1,058 interaction turns with diverse scenarios and interaction types.
arXiv cs.CV
TIER_1English(EN)·Jiuming Liu, Chaojun Ni, Mengmeng Liu, Chensheng Peng, Fangjinhua Wang, Sitian Shen, Marc Pollefeys, Masayoshi Tomizuka, Ayush Tewari, Per Ola Kristensson·
arXiv:2606.01164v1 Announce Type: new Abstract: With rapid development of large language models and diffusion-based content generation, world modeling has attracted increasing research attention, benefiting various downstream domains such as game engines, embodied AI, autonomous …
Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibit…
arXiv:2605.30263v1 Announce Type: new Abstract: Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require contro…
Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require controllable, causal, and low-latency rollout, which i…
arXiv:2605.25874v1 Announce Type: new Abstract: Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive mult…
arXiv cs.CV
TIER_1English(EN)·Bohai Gu, Taiyi Wu, Yueyang Yuan, Jian Liu, Xiaocheng Lu, Dazhao Du, Jie Zhang, Jinxiang Lai, Shuai Yang, Xiaotong Zhao, Alan Zhao, Song Guo·
arXiv:2605.25077v1 Announce Type: new Abstract: Recent video-based world models have made pixel-space environments interactive at the camera level: users can navigate viewpoints while the model generates coherent visual continuations. Yet their action spaces remain incomplete: us…
Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive multi-turn benchmark for interactive world model eva…