A new paper explores how large language models (LLMs) perform on automated short answer scoring (ASAS), particularly with partially correct responses. Researchers found that while LLMs like GPT-5.2, GPT-4o, and Claude Opus 4.5 excel at scoring fully correct or incorrect answers, they significantly degrade when evaluating mid-range, nuanced responses. This degradation is linked to the amount of task-specific data used; few-shot LLMs with minimal examples perform worst, while fine-tuned models show better performance, highlighting potential inequities in student evaluations. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Highlights potential biases in LLM-based educational tools, urging focus on fairness for developing student understanding.
RANK_REASON Academic paper detailing model performance on a specific task. [lever_c_demoted from research: ic=1 ai=1.0]