LLMs struggle with partially correct answers in automated scoring

By PulseAugur Editorial · [1 sources] · 2026-05-26 04:00

A new research paper explores the challenges of automated short answer scoring (ASAS) using large language models (LLMs). The study found that while LLMs like GPT-5.2, GPT-4o, and Claude Opus 4.5 perform well on fully correct or incorrect answers, they significantly degrade in scoring partially correct responses. This degradation is more pronounced in few-shot LLMs and decreases with more task-specific adaptation, with fine-tuned BERT models showing better performance on these nuanced answers. The research highlights the potential for inequitable evaluation of student responses due to this mid-range scoring issue. AI

IMPACT Highlights potential inequities in AI-driven educational assessments, particularly for nuanced or developing student understanding.

RANK_REASON This is a research paper detailing findings on LLM performance in a specific task. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Abigail Victoria Gurin Schleifer, Moriah Ariely, Beata Beigman Klebanov, Asaf Salman, Giora Alexandron · 2026-05-26 04:00

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

arXiv:2605.07647v2 Announce Type: replace-cross Abstract: Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs broad world knowledge and ease of deployment,…

COVERAGE [1]

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

RELATED ENTITIES

RELATED TOPICS