Native Active Perception as Reasoning for Omni-Modal Understanding
Researchers have introduced OmniAgent, a novel omni-modal agent designed for video understanding that utilizes an iterative Observation-Thought-Action cycle based on Partially Observable Markov Decision Processes (POMDPs). This approach allows the agent to selectively distill audio-visual cues into a textual memory, thereby decoupling reasoning complexity from raw video duration and improving computational efficiency. The paper details two key training methodologies: Agentic Supervised Fine-Tuning for bootstrapping active perception and Agentic Reinforcement Learning with TAURA for optimizing credit assignment. OmniAgent has demonstrated state-of-the-art performance on benchmarks like LVBench, outperforming larger models such as Qwen2.5-VL-72B. AI
IMPACT Introduces a more efficient approach to video understanding by selectively processing information, potentially reducing computational costs for long-form content analysis.