PulseAugur
LIVE 10:53:33
research · [2 sources] ·
0
research

COHERENCE benchmark evaluates MLLMs' fine-grained image-text alignment in interleaved contexts

Researchers have introduced COHERENCE, a new benchmark designed to assess the fine-grained image-text alignment capabilities of Multimodal Large Language Models (MLLMs). Existing benchmarks often overlook the complexities of interleaved image-text contexts found in real-world documents. COHERENCE addresses this gap by evaluating MLLMs' ability to connect visual and textual information within such mixed-media environments, covering four distinct domains and featuring over 6,000 questions. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Provides a new evaluation framework for multimodal models, highlighting current limitations in understanding interleaved image-text data.

RANK_REASON This is a research paper introducing a new benchmark for evaluating AI models.

Read on arXiv cs.CV →

COVERAGE [2]

  1. arXiv cs.AI TIER_1 · Bingli Wang, Huanze Tang, Haijun Lv, Zhishan Lin, Lixin Gu, Lei Feng, Qipeng Guo, Kai Chen ·

    COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

    arXiv:2604.27389v1 Announce Type: cross Abstract: In recent years, Multimodal Large Language Models (MLLMs) have achieved remarkable progress on a wide range of multimodal benchmarks. Despite these advances, most existing benchmarks mainly focus on single-image or multi-image com…

  2. arXiv cs.CV TIER_1 · Kai Chen ·

    COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

    In recent years, Multimodal Large Language Models (MLLMs) have achieved remarkable progress on a wide range of multimodal benchmarks. Despite these advances, most existing benchmarks mainly focus on single-image or multi-image comprehension. In real-world scenarios such as docume…