Researchers have introduced COHERENCE, a new benchmark designed to assess the fine-grained image-text alignment capabilities of Multimodal Large Language Models (MLLMs). Existing benchmarks often overlook the complexities of interleaved image-text contexts found in real-world documents. COHERENCE addresses this gap by evaluating MLLMs' ability to connect visual and textual information within such mixed-media environments, covering four distinct domains and featuring over 6,000 questions. AI
Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →
IMPACT Provides a new evaluation framework for multimodal models, highlighting current limitations in understanding interleaved image-text data.
RANK_REASON This is a research paper introducing a new benchmark for evaluating AI models.