LLMs struggle with nuanced answers in automated scoring, study finds

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

A new paper explores how large language models (LLMs) perform on automated short answer scoring (ASAS), particularly with partially correct responses. Researchers found that while LLMs like GPT-5.2, GPT-4o, and Claude Opus 4.5 excel at scoring fully correct or incorrect answers, they significantly degrade when evaluating mid-range, nuanced responses. This degradation is linked to the amount of task-specific data used; few-shot LLMs with minimal examples perform worst, while fine-tuned models show better performance, highlighting potential inequities in student evaluations. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Highlights potential biases in LLM-based educational tools, urging focus on fairness for developing student understanding.

RANK_REASON Academic paper detailing model performance on a specific task. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

paper
safety

COVERAGE [1]

arXiv cs.CL TIER_1 · Giora Alexandron · 2026-05-08 12:12

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on co…

COVERAGE [1]

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

RELATED ENTITIES

RELATED TOPICS