Brief

last 24h

[2/2] 222 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

RESEARCH · arXiv cs.CV English(EN) · 1w · [5 sources]

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

Researchers have introduced several new frameworks and benchmarks for advancing video understanding and editing capabilities in AI models. Aurora utilizes an agentic framework with a tool-augmented vision-language model to interpret raw user requests for video editing, mapping them to structured edit plans for diffusion transformers. OmniPro offers a comprehensive benchmark for omni-proactive streaming video understanding, evaluating models on their ability to autonomously decide when and what to say from audio-visual streams, with a focus on audio's role and long-horizon robustness. R3-Streaming presents an efficient framework for streaming video understanding that dynamically compresses memory and routes computation based on query complexity, achieving state-of-the-art results with significant token reduction. VideoSeeker introduces a paradigm for instance-level video understanding using visual prompts and agentic tool invocation, outperforming models like GPT-4o and Gemini-2.5-Pro on specific tasks. AI

IMPACT These advancements push the boundaries of AI in video processing, enabling more sophisticated editing tools and robust real-time understanding of dynamic visual and audio content.
- GPT-4o
- Gemini-2.5-Pro
- VideoSeeker
- R3-Streaming
- OmniPro
- Aurora
RESEARCH · Hugging Face Daily Papers English(EN) · 1w · [4 sources]

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

Researchers have introduced OmniPro and VideoOdyssey, two new benchmarks designed to evaluate the capabilities of omni-modal large language models in understanding long and complex video content. OmniPro focuses on proactive streaming video understanding, assessing a model's ability to decide when and what to say from audio-visual streams, and includes 2,700 human-verified samples across various tasks. VideoOdyssey targets ultra-long-context video understanding, featuring extremely long videos (average 109 minutes) and evaluating continuous reasoning and memory retention over extended periods. Both benchmarks highlight current limitations in models' long-horizon robustness, audio utilization, and fine-grained perception, particularly with non-speech audio. AI

IMPACT These benchmarks will drive the development of AI models capable of understanding complex, long-form video content, crucial for applications like surveillance, content analysis, and autonomous systems.

Brief

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding