Brief · PulseAugur

Stage-adaptive Token Selection for Efficient Omni-modal LLMs

Researchers have developed SEATS, a new method to make omni-modal large language models (om-LLMs) more efficient. SEATS prunes redundant audio-visual tokens throughout the model's layers, adapting the token selection process based on cross-modal fusion. This approach significantly reduces computational load and speeds up inference while maintaining high performance. AI

IMPACT Reduces computational overhead and speeds up inference for multi-modal LLMs, potentially lowering deployment costs.

RESEARCH · Hugging Face Daily Papers English(EN) · 1w · [4 sources]

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

Researchers have introduced OmniPro and VideoOdyssey, two new benchmarks designed to evaluate the capabilities of omni-modal large language models in understanding long and complex video content. OmniPro focuses on proactive streaming video understanding, assessing a model's ability to decide when and what to say from audio-visual streams, and includes 2,700 human-verified samples across various tasks. VideoOdyssey targets ultra-long-context video understanding, featuring extremely long videos (average 109 minutes) and evaluating continuous reasoning and memory retention over extended periods. Both benchmarks highlight current limitations in models' long-horizon robustness, audio utilization, and fine-grained perception, particularly with non-speech audio. AI

IMPACT These benchmarks will drive the development of AI models capable of understanding complex, long-form video content, crucial for applications like surveillance, content analysis, and autonomous systems.