New benchmarks target MLLM geometric and fine-grained visual perception

By PulseAugur Editorial · [3 sources] · 2026-06-15 11:30

Researchers have introduced two new benchmarks and training frameworks to address limitations in multimodal large language models (MLLMs). GePBench focuses on evaluating and improving MLLMs' fundamental geometric perception abilities, revealing significant deficiencies in current state-of-the-art models. Separately, the LOCUS framework enhances fine-grained visual perception by training MLLMs to better utilize local visual cues within an image, combating "visual context rot." AI

IMPACT These advancements aim to improve the reliability and capabilities of multimodal AI systems in understanding complex visual information.

RANK_REASON Two research papers introducing new benchmarks and training frameworks for multimodal large language models.

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

COVERAGE [3]

arXiv cs.CL TIER_1 English(EN) · Shangyu Xing, Changhao Xiang, Yuteng Han, Yifan Yue, Zhen Wu, Xinyu Liu, Zhangtai Wu, Fei Zhao, Xinyu Dai · 2026-06-16 04:00

GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models

arXiv:2412.21036v3 Announce Type: replace Abstract: Geometric shapes play important roles in both physical world and human cognition. While multimodal large language models (MLLMs) have made significant advancements in visual understanding, their abilities to recognize geometric …
arXiv cs.CV TIER_1 English(EN) · Zhou Tao, Fang Zhang, Zewen Ding, Shida Wang, Xiaokun Sun, YongXiang Hua, Haoyu Cao, Linli Xu · 2026-06-16 04:00

LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models

arXiv:2606.16586v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) remain unreliable on fine-grained visual perception, even when high-resolution inputs preserve the necessary local details. We identify this limitation as visual context rot: decisive evidenc…
arXiv cs.CV TIER_1 English(EN) · Linli Xu · 2026-06-15 11:30

LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) remain unreliable on fine-grained visual perception, even when high-resolution inputs preserve the necessary local details. We identify this limitation as visual context rot: decisive evidence may exist in the full image, yet fail to be re…

COVERAGE [3]

GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models

LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models

LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models

RELATED ENTITIES

RELATED TOPICS