English(EN) Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

新基准揭示视频大语言模型在短暂视觉事件上表现不佳

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-02 04:00

研究人员推出了 Moment-Video，一个旨在评估视频多模态大语言模型（MLLMs）时间保真度的新基准。该基准侧重于模型理解和利用对回答问题至关重要的短暂、瞬间视觉事件的能力。目前的视频 MLLMs 在处理这些瞬态事件时存在困难，由于帧采样或压缩问题，常常会错过关键细节。在新的数据集上，表现最好的模型准确率仅为 39.6%。 AI

影响凸显了视频大语言模型能力的一个关键差距，表明当前模型在时间理解方面需要显著改进才能应用于现实世界。

排序理由该集群包含一篇介绍用于评估视频多模态大语言模型的新基准的研究论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang, Yan Li, Xin Li, Haoyu Cao, Xing Sun, Shaofeng Zhang, Xu Yang, Zhihang Zhong, Xue Yang · 2026-06-02 04:00

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

arXiv:2606.02522v1 Announce Type: cross Abstract: Video multimodal large language models (MLLMs) have made rapid progress on general and long-form video understanding, yet their ability to preserve brief answer-critical visual evidence remains underexplored. Many practical questi…

报道来源 [1]

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

相关话题