Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
Researchers have introduced Moment-Video, a new benchmark designed to evaluate the temporal fidelity of video multimodal large language models (MLLMs). This benchmark focuses on the models' ability to understand and utilize brief, momentary visual events that are critical for answering questions. Current video MLLMs struggle with these transient events, often missing crucial details due to frame sampling or compression issues, as demonstrated by the best-performing model achieving only 39.6% accuracy on the new dataset. AI
IMPACT Highlights a critical gap in video LLM capabilities, suggesting current models need significant improvements in temporal understanding for real-world applications.