PulseAugur
EN
LIVE 17:46:37

New frameworks and benchmarks accelerate interactive video world models

Researchers have developed new methods to accelerate interactive video world models, which generate video content based on user camera movements. "Light Interaction" offers a training-free approach by adaptively managing context and using a denoising cache, achieving up to 2.59x speedup. Separately, the "minWM" framework provides an open-source pipeline for converting existing video diffusion models into real-time interactive world models. Additionally, a new benchmark called "WBench" has been introduced to comprehensively evaluate these interactive video world models across various dimensions. AI

IMPACT Advances in interactive video generation and world modeling could enable more realistic simulations and embodied AI training.

RANK_REASON Multiple research papers introducing new methods, frameworks, and benchmarks for interactive video world models.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 15 sources. How we write summaries →

COVERAGE [15]

  1. arXiv cs.AI TIER_1 English(EN) · Jiayi Zhu, Jianing Zhang, Yiying Yang, Wei Cheng, Xiaoyun Yuan ·

    ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

    arXiv:2603.02697v2 Announce Type: replace-cross Abstract: This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interac…

  2. arXiv cs.AI TIER_1 English(EN) · Teng Hu, Mingchun Lu, Yating Wang, Jiangning Zhang, Jinkun Hao, Ye Pan, Ran Yi, Lizhuang Ma, Dacheng Tao ·

    MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data

    arXiv:2606.02753v1 Announce Type: cross Abstract: Video world models are a foundational generative technology for embodied AI and the Metaverse, yet existing approaches are inherently limited to a single agent observing from a single perspective. Extending these models to multi-a…

  3. arXiv cs.LG TIER_1 English(EN) · Jiacheng Lu, Haoyi Zhu, Sipei Yi, Enze Xie, Yu Li, Cheng Zhuo ·

    Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

    arXiv:2605.31158v1 Announce Type: cross Abstract: Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, sc…

  4. arXiv cs.AI TIER_1 English(EN) · Taiye Chen, Xun Hu, Zihan Ding, Chi Jin ·

    VRAG: Learning World Models for Interactive Video Generation

    arXiv:2505.21996v4 Announce Type: replace-cross Abstract: Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world mode…

  5. Hugging Face Daily Papers TIER_1 English(EN) ·

    Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

    Light Interaction accelerates interactive video world models through adaptive computation strategies and optimized attention mechanisms without requiring model retraining.

  6. Hugging Face Daily Papers TIER_1 English(EN) ·

    minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

    A comprehensive framework is presented for converting bidirectional video diffusion models into real-time interactive world models with controllable, causal, and low-latency capabilities through fine-tuning and distillation techniques.

  7. arXiv cs.AI TIER_1 English(EN) · Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou ·

    Olaf-World: Orienting Latent Actions for Video World Modeling

    arXiv:2602.10104v2 Announce Type: replace-cross Abstract: Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across …

  8. Hugging Face Daily Papers TIER_1 English(EN) ·

    WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

    WBench presents a comprehensive multi-turn benchmark for evaluating interactive world models across five dimensions using 289 test cases and 1,058 interaction turns with diverse scenarios and interaction types.

  9. arXiv cs.CV TIER_1 English(EN) · Jiuming Liu, Chaojun Ni, Mengmeng Liu, Chensheng Peng, Fangjinhua Wang, Sitian Shen, Marc Pollefeys, Masayoshi Tomizuka, Ayush Tewari, Per Ola Kristensson ·

    Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends

    arXiv:2606.01164v1 Announce Type: new Abstract: With rapid development of large language models and diffusion-based content generation, world modeling has attracted increasing research attention, benefiting various downstream domains such as game engines, embodied AI, autonomous …

  10. arXiv cs.CV TIER_1 English(EN) · Cheng Zhuo ·

    Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

    Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibit…

  11. arXiv cs.CV TIER_1 English(EN) · Min Zhao, Hongzhou Zhu, Bokai Yan, Zihan Zhou, Yimin Chen, Wenqiang Sun, Kaiwen Zheng, Guande He, Xiao Yang, Chongxuan Li, Fan Bao, Jun Zhu ·

    minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

    arXiv:2605.30263v1 Announce Type: new Abstract: Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require contro…

  12. arXiv cs.CV TIER_1 English(EN) · Jun Zhu ·

    minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

    Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require controllable, causal, and low-latency rollout, which i…

  13. arXiv cs.CV TIER_1 English(EN) · Kaining Ying, Hengrui Hu, Siyu Ren, Jiamu Li, Fengjiao Chen, Ziwen Wang, Xuezhi Cao, Xunliang Cai, Henghui Ding ·

    WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

    arXiv:2605.25874v1 Announce Type: new Abstract: Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive mult…

  14. arXiv cs.CV TIER_1 English(EN) · Bohai Gu, Taiyi Wu, Yueyang Yuan, Jian Liu, Xiaocheng Lu, Dazhao Du, Jie Zhang, Jinxiang Lai, Shuai Yang, Xiaotong Zhao, Alan Zhao, Song Guo ·

    WorldCraft: From Camera Navigation to Object Manipulation in Interactive Video World Models

    arXiv:2605.25077v1 Announce Type: new Abstract: Recent video-based world models have made pixel-space environments interactive at the camera level: users can navigate viewpoints while the model generates coherent visual continuations. Yet their action spaces remain incomplete: us…

  15. arXiv cs.CV TIER_1 English(EN) · Henghui Ding ·

    WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

    Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive multi-turn benchmark for interactive world model eva…