Brief · PulseAugur

TOOL · arXiv cs.CV English(EN) · 4d

Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning

Researchers have developed Video-o3, a new framework designed to improve the understanding of long videos by enabling iterative discovery of relevant visual clues and fine-grained inspection of key segments. The system addresses challenges in tool invocation for multimodal models by using Task-Decoupled Attention Masking to separate reasoning and tool-calling while preserving global context. To manage context length and improve efficiency, it employs a Verifiable Trajectory-Guided Reward mechanism. The framework is supported by a data synthesis pipeline that created Seeker-173K, a dataset of 173,000 tool-interaction trajectories, leading to significant performance gains on benchmarks like MLVU and Video-Holmes. AI

IMPACT Introduces a novel framework for long video understanding, potentially improving AI's ability to process and reason over extensive video content.

MLVU
Video-Holmes
Video-o3
Xiangyu Zeng
Seeker-173K