Incantation uses natural language for multi-entity video control

By PulseAugur Editorial · [1 sources] · 2026-05-18 16:12

Researchers have introduced Incantation, a novel interactive video world model that utilizes natural language as its primary action interface. This approach allows for fine-grained control over multiple entities within video simulations and enables cross-entity generalization, overcoming limitations of previous control protocols. The model demonstrates significant improvements in handling out-of-vocabulary prompts and cross-entity transfer compared to existing baselines, while also achieving real-time performance. AI

IMPACT Enables more intuitive and flexible control over complex simulated environments, potentially advancing AI-driven content creation and interactive simulations.

RANK_REASON The cluster contains a new academic paper detailing a novel model and its capabilities. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Incantation uses natural language for multi-entity video control

COVERAGE [1]

arXiv cs.CV TIER_1 English(EN) · Ruili Feng · 2026-05-18 16:12

Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models

Modern interactive video world models have achieved impressive visual fidelity, yet lack fine-grained multi-entity control and cross-entity, cross-world generalization. We trace this gap to the action interface: standard control protocols (e.g. animation IDs, device inputs, scene…

COVERAGE [1]

Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models

RELATED ENTITIES

RELATED TOPICS