New multimodal benchmark uses 900K Japanese student responses

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have developed a new multimodal benchmark using data from Japan's National Assessment of Academic Ability, which includes approximately 900,000 aggregated student responses. This dataset features real exam materials from science, mathematics, and Japanese language subjects, preserving authentic layouts and diagrams. It aims to provide a human-grounded evaluation framework for multimodal large language models (MLLMs) by allowing direct comparison between model and human performance. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Establishes a new, human-grounded benchmark for evaluating multimodal LLMs in educational contexts, particularly for Japanese language assessments.

RANK_REASON Academic paper introducing a new dataset and benchmark for evaluating multimodal LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

COVERAGE [1]

arXiv cs.CL TIER_1 · Yusuke Miyao · 2026-05-12 07:22

Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

Authentic school examinations provide a high-validity test bed for evaluating multimodal large language models (MLLMs), yet benchmarks grounded in Japanese K-12 assessments remain scarce. We present a multimodal dataset constructed from Japan's National Assessment of Academic Abi…

COVERAGE [1]

Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

RELATED ENTITIES

RELATED TOPICS