Researchers have introduced CiteVQA, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to accurately attribute answers to specific source regions within documents. Unlike previous evaluations that only scored the final answer, CiteVQA requires models to provide element-level bounding-box citations alongside their answers, assessing both jointly. This benchmark, comprising 1,897 questions across 711 PDFs, reveals a significant issue termed "Attribution Hallucination," where models often provide correct answers but cite incorrect evidence, highlighting a critical reliability gap in current document intelligence systems. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT This benchmark highlights a critical flaw in current LLMs' ability to cite sources, potentially impacting trust and reliability in high-stakes applications.
RANK_REASON The cluster describes a new academic benchmark for evaluating AI models. [lever_c_demoted from research: ic=1 ai=1.0]