Researchers have introduced M$^3$-VQA, a new benchmark designed to evaluate multimodal large language models (MLLMs) on complex reasoning tasks involving multiple entities and multi-hop inference. The benchmark challenges models to understand fine-grained details across visual and textual sources, requiring sequential and parallel reasoning. Initial evaluations of 16 leading MLLMs revealed significant limitations in their knowledge acquisition and reasoning capabilities, though performance improved substantially when provided with precise evidence. AI
Summary written by gemini-2.5-flash-lite from 6 sources. How we write summaries →
IMPACT This benchmark will drive advancements in multimodal reasoning for LLMs by highlighting current limitations.
RANK_REASON Introduces a new benchmark for evaluating multimodal large language models.