Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning
Researchers have developed Video-o3, a new framework designed to improve the understanding of long videos by enabling iterative discovery of relevant visual clues and fine-grained inspection of key segments. The system addresses challenges in tool invocation for multimodal models by using Task-Decoupled Attention Masking to separate reasoning and tool-calling while preserving global context. To manage context length and improve efficiency, it employs a Verifiable Trajectory-Guided Reward mechanism. The framework is supported by a data synthesis pipeline that created Seeker-173K, a dataset of 173,000 tool-interaction trajectories, leading to significant performance gains on benchmarks like MLVU and Video-Holmes. AI
IMPACT Introduces a novel framework for long video understanding, potentially improving AI's ability to process and reason over extensive video content.