Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
Researchers have introduced Moment-Video, a new benchmark designed to evaluate the temporal fidelity of video multimodal large language models (MLLMs). This benchmark focuses on the models' ability to understand brief, critical visual events that can be missed by current sampling and compression techniques. Evaluations of 33 MLLMs showed that even the top performer, Seed-2.0-Pro, achieved only 39.6% accuracy, highlighting a significant gap in their capacity to process and utilize transient visual information. AI
IMPACT Highlights a critical limitation in video LLMs, potentially driving research into more temporally aware architectures and evaluation methods.