Researchers have developed a new multimodal benchmark using data from Japan's National Assessment of Academic Ability, which includes approximately 900,000 aggregated student responses. This dataset features real exam materials from science, mathematics, and Japanese language subjects, preserving authentic layouts and diagrams. It aims to provide a human-grounded evaluation framework for multimodal large language models (MLLMs) by allowing direct comparison between model and human performance. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Establishes a new, human-grounded benchmark for evaluating multimodal LLMs in educational contexts, particularly for Japanese language assessments.
RANK_REASON Academic paper introducing a new dataset and benchmark for evaluating multimodal LLMs. [lever_c_demoted from research: ic=1 ai=1.0]