Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 6h

Creating and Evaluating K-12 GenAI Assessment Graders Through Context Engineering

A new paper explores the use of generative AI models for grading K-12 assessments, focusing on context engineering and prompt design. Researchers evaluated models like Claude Sonnet 4, Haiku 4.5, GPT-5, and GPT-5 Mini using MCAS data across math, science, and ELA. The study found that LLM graders, particularly those with more parameters, showed substantial agreement with human raters in math and science, though performance varied in ELA. While AI-generated narrative feedback was well-received, numerical scores generated skepticism, suggesting LLMs are more effective as formative tools. AI

IMPACT Suggests LLMs can effectively assist educators with grading, potentially reducing workload and enhancing feedback quality, particularly in STEM subjects.

GPT-5
LLM
Claude Haiku 4.5
Claude Sonnet 4
GPT-5 Mini
Massachusetts Comprehensive Assessment System