English(EN) Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

MLLM在图像文本上表现出推理差距；自蒸馏弥合鸿沟

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-26 04:00

研究人员发现，在处理以图像形式呈现的文本时，多模态大型语言模型（MLLM）与处理标准文本标记相比，存在显著的性能差距。这种“模态鸿沟”主要是由模型在处理视觉输入时减少推理所驱动的，导致输出更短、计算量更少。一种新的自蒸馏微调方法，将图像输入与其在文本模式下的模型自身推理痕迹配对，有效地弥合了这一差距，提高了准确性，并将收益转移到新的基准测试中。 AI

影响识别出MLLM的一个关键局限性，并提出了一种改进其在视觉文本输入上推理能力的方法。

排序理由学术论文，详细介绍了多模态LLM的一项新发现和方法。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

Multimodal large language models

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CL TIER_1 English(EN) · Kaiser Sun, Xiaochuang Yuan, Hongjun Liu, Chen Zhao, Cheng Zhang, Mark Dredze, Fan Bai · 2026-05-26 04:00

阅读而非思考：理解和弥合多模态大模型中文本转化为像素时的模态鸿沟

arXiv:2603.09095v2 Announce Type: replace Abstract: Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating s…

报道来源 [1]

阅读而非思考：理解和弥合多模态大模型中文本转化为像素时的模态鸿沟

相关实体

相关话题