New HG-Bench benchmark reveals AI struggles with handwritten homework assessment

By PulseAugur Editorial · [1 sources] · 2026-06-24 07:18

Researchers have introduced HG-Bench, a new benchmark designed to evaluate the ability of AI models to accurately locate and ground answer regions within multi-page handwritten homework assignments. The benchmark consists of 500 annotated K-12 homework samples and includes a page-aware evaluation protocol that measures both complete answer localization and step-level decomposition. Current frontier closed-source APIs and open-weight VLMs perform poorly on HG-Bench, with no zero-shot system exceeding 55.22% on complete-answer localization. However, a GLM-4.6V 9B model fine-tuned on approximately 10,000 in-domain examples achieved significantly higher scores, highlighting a capability gap in handwritten reasoning grounding. AI

IMPACT Establishes a new benchmark for evaluating AI's ability to understand and ground handwritten reasoning in educational contexts.

RANK_REASON The cluster describes a new benchmark and evaluation protocol for AI models, published on arXiv. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New HG-Bench benchmark reveals AI struggles with handwritten homework assessment

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Juanzi Li · 2026-06-24 07:18

HG-Bench: A Benchmark for Multi-Page Handwritten Answer-Region Grounding in Automated Homework Assessment

Automated homework assessment depends not only on recognizing student answers, but also on accurately locating where each answer and each intermediate reasoning step appears in noisy, multi-page handwritten work. This paper addresses the missing evaluation setting of page-aware, …

COVERAGE [1]

HG-Bench: A Benchmark for Multi-Page Handwritten Answer-Region Grounding in Automated Homework Assessment

RELATED ENTITIES

RELATED TOPICS