PulseAugur
EN
LIVE 15:40:59

New benchmarks assess omni-modal LLMs on long video understanding

Researchers have introduced OmniPro and VideoOdyssey, two new benchmarks designed to evaluate the capabilities of omni-modal large language models in understanding long and complex video content. OmniPro focuses on proactive streaming video understanding, assessing a model's ability to decide when and what to say from audio-visual streams, and includes 2,700 human-verified samples across various tasks. VideoOdyssey targets ultra-long-context video understanding, featuring extremely long videos (average 109 minutes) and evaluating continuous reasoning and memory retention over extended periods. Both benchmarks highlight current limitations in models' long-horizon robustness, audio utilization, and fine-grained perception, particularly with non-speech audio. AI

IMPACT These benchmarks will drive the development of AI models capable of understanding complex, long-form video content, crucial for applications like surveillance, content analysis, and autonomous systems.

RANK_REASON Two new research papers introduce benchmarks for evaluating AI models.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 5 sources. How we write summaries →

COVERAGE [5]

  1. Hugging Face Daily Papers TIER_1 English(EN) ·

    OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

    OmniPro is introduced as the first benchmark for evaluating omni-modal large language models' proactive streaming video understanding, featuring diverse tasks and dual-mode evaluation protocols.

  2. arXiv cs.CV TIER_1 English(EN) · Peiran Wu, Yunze Liu, Chi-Hao Wu, Chen Chen, Junxiao Shen ·

    O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding

    arXiv:2605.26584v1 Announce Type: new Abstract: Omnimodal large language models enable unified audio video understanding, but long joint token sequences make inference costly, and existing benchmarks do not fully isolate audio visual association in noisy user generated videos. We…

  3. arXiv cs.CV TIER_1 English(EN) · Ming Xie, Zizheng Huang, Xudong Tan, Chao Wang, Xiangyu Zeng, Wenxiao Wu, Tao Chen, Limin Wang, Yanwei Fu ·

    StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering

    arXiv:2605.25621v1 Announce Type: new Abstract: While streaming omni-video understanding demands continuous perception and proactive, real-time interaction, this crucial area remains largely under-explored. Current omni-modal methods are inherently designed for offline settings, …

  4. arXiv cs.CV TIER_1 English(EN) · Yanwei Fu ·

    StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering

    While streaming omni-video understanding demands continuous perception and proactive, real-time interaction, this crucial area remains largely under-explored. Current omni-modal methods are inherently designed for offline settings, limiting their applicability in streaming scenar…

  5. arXiv cs.CV TIER_1 English(EN) · Haichen He, Jiayi Zhou, Sifeng Shang, Yihan Hu, Yuanhan Zhang, Kaiyang Zhou ·

    VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding

    arXiv:2605.22907v1 Announce Type: new Abstract: Real-world long video understanding requires models to perform continuous tracking, information integration and memory retention over massive temporal spans within extreme video durations. Mastering this intense cognitive load const…