English(EN) LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models

新基准测试旨在解决 MLLM 的几何和细粒度视觉感知问题

作者 PulseAugur 编辑部 · [3 个来源] · 2026-06-15 11:30

研究人员推出了两个新的基准测试和训练框架，以解决多模态大语言模型 (MLLM) 的局限性。GePBench 专注于评估和改进 MLLM 的基本几何感知能力，揭示了当前最先进模型存在的显著缺陷。此外，LOCUS 框架通过训练 MLLM 更好地利用图像中的局部视觉线索来增强细粒度视觉感知，以对抗“视觉上下文衰退”。 AI

影响这些进展旨在提高多模态人工智能系统在理解复杂视觉信息方面的可靠性和能力。

排序理由两篇研究论文介绍了用于多模态大语言模型的新基准测试和训练框架。

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。我们如何撰写摘要 →

报道来源 [3]

arXiv cs.CL TIER_1 English(EN) · Shangyu Xing, Changhao Xiang, Yuteng Han, Yifan Yue, Zhen Wu, Xinyu Liu, Zhangtai Wu, Fei Zhao, Xinyu Dai · 2026-06-16 04:00

GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models

arXiv:2412.21036v3 Announce Type: replace Abstract: Geometric shapes play important roles in both physical world and human cognition. While multimodal large language models (MLLMs) have made significant advancements in visual understanding, their abilities to recognize geometric …
arXiv cs.CV TIER_1 English(EN) · Zhou Tao, Fang Zhang, Zewen Ding, Shida Wang, Xiaokun Sun, YongXiang Hua, Haoyu Cao, Linli Xu · 2026-06-16 04:00

LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models

arXiv:2606.16586v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) remain unreliable on fine-grained visual perception, even when high-resolution inputs preserve the necessary local details. We identify this limitation as visual context rot: decisive evidenc…
arXiv cs.CV TIER_1 English(EN) · Linli Xu · 2026-06-15 11:30

LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) remain unreliable on fine-grained visual perception, even when high-resolution inputs preserve the necessary local details. We identify this limitation as visual context rot: decisive evidence may exist in the full image, yet fail to be re…

报道来源 [3]

GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models

LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models

LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models

相关实体

相关话题