English(EN) MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

新的MET-Bench基准揭示了视觉语言模型的局限性

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-15 04:00

研究人员推出了MET-Bench，一个旨在评估视觉语言模型在文本和图像模态中跟踪实体能力的新基准。研究发现，纯文本和多模态实体跟踪之间存在显著的性能差距，这主要归因于视觉推理缺陷而非感知问题。虽然明确的基于文本的推理策略有所改进，但长时域多模态任务仍然具有挑战性。将强化学习应用于开源VLM在模态内取得了进展，但未能有效地跨模态转移，这表明需要增强多模态表示和推理技术。 AI

影响突出了当前视觉语言模型在多模态推理方面的关键差距，为未来的研究和开发指明了方向。

排序理由该集群描述了一篇介绍用于评估AI模型基准的新学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CL TIER_1 English(EN) · Vanya Cohen, Raymond Mooney · 2026-06-15 04:00

MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

arXiv:2502.10886v3 Announce Type: replace Abstract: Entity state tracking is a necessary component of world modeling that requires maintaining coherent representations of entities over time. Previous work has benchmarked entity tracking performance in purely text-based tasks. We …

报道来源 [1]

MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

相关实体

相关话题