PulseAugur
EN
LIVE 10:09:04

New multimodal benchmark uses 900K Japanese student responses

Researchers have developed a new multimodal benchmark using data from Japan's National Assessment of Academic Ability, which includes approximately 900,000 aggregated student responses. This dataset features real exam materials from science, mathematics, and Japanese language subjects, preserving authentic layouts and diagrams. It aims to provide a human-grounded evaluation framework for multimodal large language models (MLLMs) by allowing direct comparison between model and human performance. AI

IMPACT Establishes a new, human-grounded benchmark for evaluating multimodal LLMs in educational contexts, particularly for Japanese language assessments.

RANK_REASON Academic paper introducing a new dataset and benchmark for evaluating multimodal LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New multimodal benchmark uses 900K Japanese student responses

COVERAGE [1]

  1. arXiv cs.CL TIER_1 English(EN) · Yusuke Miyao ·

    Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

    Authentic school examinations provide a high-validity test bed for evaluating multimodal large language models (MLLMs), yet benchmarks grounded in Japanese K-12 assessments remain scarce. We present a multimodal dataset constructed from Japan's National Assessment of Academic Abi…