COHERENCE benchmark evaluates MLLMs' fine-grained image-text alignment in interleaved contexts

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 2 sources

Researchers have introduced COHERENCE, a new benchmark designed to assess the fine-grained image-text alignment capabilities of Multimodal Large Language Models (MLLMs). Existing benchmarks often overlook the complexities of interleaved image-text contexts found in real-world documents. COHERENCE addresses this gap by evaluating MLLMs' ability to connect visual and textual information within such mixed-media environments, covering four distinct domains and featuring over 6,000 questions. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Provides a new evaluation framework for multimodal models, highlighting current limitations in understanding interleaved image-text data.

RANK_REASON This is a research paper introducing a new benchmark for evaluating AI models.

Read on arXiv cs.CV →

paper
other

COVERAGE [2]

arXiv cs.AI TIER_1 · Bingli Wang, Huanze Tang, Haijun Lv, Zhishan Lin, Lixin Gu, Lei Feng, Qipeng Guo, Kai Chen · 2026-05-01 04:00

COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

arXiv:2604.27389v1 Announce Type: cross Abstract: In recent years, Multimodal Large Language Models (MLLMs) have achieved remarkable progress on a wide range of multimodal benchmarks. Despite these advances, most existing benchmarks mainly focus on single-image or multi-image com…
arXiv cs.CV TIER_1 · Kai Chen · 2026-04-30 03:59

COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

In recent years, Multimodal Large Language Models (MLLMs) have achieved remarkable progress on a wide range of multimodal benchmarks. Despite these advances, most existing benchmarks mainly focus on single-image or multi-image comprehension. In real-world scenarios such as docume…

COVERAGE [2]

COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

RELATED ENTITIES

RELATED TOPICS