A new research paper explores the challenges of automated short answer scoring (ASAS) using large language models (LLMs). The study found that while LLMs like GPT-5.2, GPT-4o, and Claude Opus 4.5 perform well on fully correct or incorrect answers, they significantly degrade in scoring partially correct responses. This degradation is more pronounced in few-shot LLMs and decreases with more task-specific adaptation, with fine-tuned BERT models showing better performance on these nuanced answers. The research highlights the potential for inequitable evaluation of student responses due to this mid-range scoring issue. AI
IMPACT Highlights potential inequities in AI-driven educational assessments, particularly for nuanced or developing student understanding.
RANK_REASON This is a research paper detailing findings on LLM performance in a specific task. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →