English(EN) SSMNBench: Diagnosing Image-based Cross-View Human-Object Understanding via Single-View Sufficiency and Multi-View Necessity

新基准揭示多模态大语言模型（MLLMs）在跨视图理解方面存在困难

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-24 09:38

研究人员开发了SSMNBench，一个旨在评估多模态大语言模型（MLLMs）跨视图人与物理解能力的新诊断基准。该基准包含3,300个问答对，分为单视图充分性（SVS）和多视图必要性（MVN）任务。使用SSMNBench进行的评估显示，当前的MLLMs在整合来自多个视图的碎片化证据方面存在困难，并且在面对冗余视觉信息时容易出现“干扰退化”，这表明它们依赖于语义平均而不是真正的跨视图合成。 AI

影响突出了当前MLLMs的基本局限性，指导未来研究朝着更鲁棒的跨视图推理架构发展。

排序理由该集群包含一篇介绍新AI模型评估基准的研究论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CV TIER_1 English(EN) · Xin Yu · 2026-06-24 09:38

SSMNBench: Diagnosing Image-based Cross-View Human-Object Understanding via Single-View Sufficiency and Multi-View Necessity

Multimodal Large Language Models (MLLMs) have shown remarkable progress in single-image perception, yet their ability to reason about complex cross-view human-centric scenes remains largely unverified. Current multi-view benchmarks evaluate models using a fixed "bag of frames" an…

报道来源 [1]

SSMNBench: Diagnosing Image-based Cross-View Human-Object Understanding via Single-View Sufficiency and Multi-View Necessity

相关实体

相关话题