Researchers have introduced HG-Bench, a new benchmark designed to evaluate the ability of AI models to accurately locate and ground answer regions within multi-page handwritten homework assignments. The benchmark consists of 500 annotated K-12 homework samples and includes a page-aware evaluation protocol that measures both complete answer localization and step-level decomposition. Current frontier closed-source APIs and open-weight VLMs perform poorly on HG-Bench, with no zero-shot system exceeding 55.22% on complete-answer localization. However, a GLM-4.6V 9B model fine-tuned on approximately 10,000 in-domain examples achieved significantly higher scores, highlighting a capability gap in handwritten reasoning grounding. AI
IMPACT Establishes a new benchmark for evaluating AI's ability to understand and ground handwritten reasoning in educational contexts.
RANK_REASON The cluster describes a new benchmark and evaluation protocol for AI models, published on arXiv. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →