Researchers have introduced CiteVQA, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to accurately attribute answers to specific source regions within documents. Unlike previous evaluations that only scored the final answer, CiteVQA requires models to provide element-level bounding-box citations alongside their answers, assessing both jointly. This benchmark, comprising 1,897 questions across 711 PDFs, reveals a significant issue termed "Attribution Hallucination," where models often provide correct answers but cite incorrect evidence, highlighting a critical reliability gap in current document intelligence systems. AI
影响 This benchmark highlights a critical flaw in current LLMs' ability to cite sources, potentially impacting trust and reliability in high-stakes applications.
排序理由 The cluster describes a new academic benchmark for evaluating AI models. [lever_c_demoted from research: ic=1 ai=1.0]
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →