English(EN) Faithful Grounded Visual Reasoning via Learned Proxy-Tokens

新AI模型Composer使用代理令牌以提高视觉推理和可解释性

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-22 13:54

研究人员开发了一种新的多模态大语言模型（MLLM），名为Composer，旨在提高AI系统的可解释性和可信度。Composer利用学习到的代理令牌明确地将文本解释与视觉基础信息联系起来，解决了当前模型中存在的语义空间差距。这种新颖的机制允许模型将视觉区域视为可寻址的集合，将视觉基础的准确性提高了9.0个百分点，同时保持了最终答案的准确性。创建了一个名为ComposerGCoT的新数据集来严格评估这一基础机制，证明离散代理令牌比传统的文本坐标在捕捉空间语义方面更有效。 AI

影响引入了一种新颖的多模态大语言模型（MLLM）可解释性方法，有望提高关键AI应用的信任度和可靠性。

排序理由该集群包含一篇详细介绍新模型和数据集的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CV TIER_1 English(EN) · Angelique Loesch · 2026-06-22 13:54

Faithful Grounded Visual Reasoning via Learned Proxy-Tokens

Multimodal Large Language Models (MLLMs) have achieved remarkable success in Visual Question Answering (VQA), yet their "black-box" nature hinders deployment in critical domains. Grounded Visual Reasoning (GVR) approaches attempt to improve interpretability by explicitly couple t…

报道来源 [1]

Faithful Grounded Visual Reasoning via Learned Proxy-Tokens

相关实体

相关话题