PulseAugur
EN
LIVE 16:05:58

New benchmarks assess omni-modal LLMs on long video understanding

Researchers have introduced OmniPro and VideoOdyssey, two new benchmarks designed to evaluate the capabilities of omni-modal large language models in understanding long and complex video content. OmniPro focuses on proactive streaming video understanding, assessing a model's ability to decide when and what to say from audio-visual streams, and includes 2,700 human-verified samples across various tasks. VideoOdyssey targets ultra-long-context video understanding, featuring extremely long videos (average 109 minutes) and evaluating continuous reasoning and memory retention over extended periods. Both benchmarks highlight current limitations in models' long-horizon robustness, audio utilization, and fine-grained perception, particularly with non-speech audio. AI

IMPACT These benchmarks will drive the development of AI models capable of understanding complex, long-form video content, crucial for applications like surveillance, content analysis, and autonomous systems.

RANK_REASON Two new research papers introduce benchmarks for evaluating AI models.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 4 sources. How we write summaries →

COVERAGE [4]

  1. Hugging Face Daily Papers TIER_1 English(EN) ·

    OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

    OmniPro is introduced as the first benchmark for evaluating omni-modal large language models' proactive streaming video understanding, featuring diverse tasks and dual-mode evaluation protocols.

  2. arXiv cs.CV TIER_1 English(EN) · Ming Xie, Zizheng Huang, Xudong Tan, Chao Wang, Xiangyu Zeng, Wenxiao Wu, Tao Chen, Limin Wang, Yanwei Fu ·

    StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering

    arXiv:2605.25621v1 Announce Type: new Abstract: While streaming omni-video understanding demands continuous perception and proactive, real-time interaction, this crucial area remains largely under-explored. Current omni-modal methods are inherently designed for offline settings, …

  3. arXiv cs.CV TIER_1 English(EN) · Yanwei Fu ·

    StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering

    While streaming omni-video understanding demands continuous perception and proactive, real-time interaction, this crucial area remains largely under-explored. Current omni-modal methods are inherently designed for offline settings, limiting their applicability in streaming scenar…

  4. arXiv cs.CV TIER_1 English(EN) · Haichen He, Jiayi Zhou, Sifeng Shang, Yihan Hu, Yuanhan Zhang, Kaiyang Zhou ·

    VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding

    arXiv:2605.22907v1 Announce Type: new Abstract: Real-world long video understanding requires models to perform continuous tracking, information integration and memory retention over massive temporal spans within extreme video durations. Mastering this intense cognitive load const…