Researchers have introduced DRAGON, a new benchmark designed to evaluate how well vision-language models (VLMs) can ground their reasoning in specific visual evidence within diagrams. This benchmark addresses the limitation where models might achieve correct answers through spurious correlations rather than genuine understanding of the visual information. DRAGON includes over 11,000 annotated question instances from six existing diagram QA datasets, with a test set featuring human-verified reasoning evidence annotations. Eight VLMs were evaluated on their ability to localize this evidence across various diagram types, aiming to improve interpretability and reliability in diagram-based reasoning. AI
Summary written by gemini-2.5-flash-lite from 4 sources. How we write summaries →
IMPACT Improves evaluation of visual reasoning in diagrams, pushing for more interpretable and reliable AI systems.
RANK_REASON This is a research paper introducing a new benchmark for evaluating AI models.