New DiCoBench benchmark reveals MLLM struggles with high-resolution visual perception

By PulseAugur Editorial · [2 sources] · 2026-06-25 05:02

Researchers have introduced DiCoBench, a new benchmark designed to evaluate the fine-grained perception capabilities of Multimodal Large Language Models (MLLMs) using high-resolution, multi-image inputs. The benchmark features 765 samples across two tracks and eight perception tasks, focusing on differential and commonality visual cues. Evaluations of 18 MLLMs showed a significant performance gap compared to human accuracy, highlighting challenges in capturing micro-scale details. AI

IMPACT Highlights limitations in current MLLMs for high-resolution visual tasks, potentially guiding future research in perception capabilities.

RANK_REASON The cluster describes a new academic benchmark paper for evaluating AI models.

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New DiCoBench benchmark reveals MLLM struggles with high-resolution visual perception

COVERAGE [2]

arXiv cs.CV TIER_1 English(EN) · Geng Li, Yuxin Peng · 2026-06-26 04:00

DiCoBench: Benchmarking Multi-Image Fine-Grained Perception via Differential and Commonality Visual Cues

arXiv:2606.26602v1 Announce Type: new Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive fine-grained perception capabilities. However, existing benchmarks predominantly rely on explicit textual cues or low-resolution inputs, fa…
arXiv cs.CV TIER_1 English(EN) · Yuxin Peng · 2026-06-25 05:02

DiCoBench: Benchmarking Multi-Image Fine-Grained Perception via Differential and Commonality Visual Cues

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive fine-grained perception capabilities. However, existing benchmarks predominantly rely on explicit textual cues or low-resolution inputs, failing to evaluate a model's ability to autonomou…

COVERAGE [2]

DiCoBench: Benchmarking Multi-Image Fine-Grained Perception via Differential and Commonality Visual Cues

DiCoBench: Benchmarking Multi-Image Fine-Grained Perception via Differential and Commonality Visual Cues

RELATED ENTITIES

RELATED TOPICS