Omnimodal LLMs 无法对检测到的感官矛盾采取行动

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-13 16:14

研究人员在全模态大型语言模型中发现了一个“表征-行动鸿沟”，即模型能够内部识别文本声明与其感官输入之间的矛盾，但无法在其输出中反映出来。使用电影片段创建了一个新的基准 IMAVB 来测试此能力，结果显示当前模型要么接受错误的假设，要么拒绝过多的标准声明。研究表明，这些模型中基础化的瓶颈在于将感知转化为行动，而不是感知本身。 AI

影响突出了全模态 LLM 基础化中的一个关键鸿沟，表明当前模型难以将感知到的信息转化为可靠的行动。

排序理由该集群包含一篇学术论文，详细介绍了新的基准和关于 LLM 能力的发现。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Ziwei Liu · 2026-05-13 16:14

Senses Wide Shut: 全模态大模型中的表征-行动鸿沟

When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action? Recent omnimodal models are positioned as perception-grounded agents that jointly process video, audio, and…

报道来源 [1]

Senses Wide Shut: 全模态大模型中的表征-行动鸿沟

相关实体

相关话题