LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment
A new study published on arXiv investigates the ability of large language models (LLMs) to measure item discrimination in educational assessments. Researchers evaluated 42 LLMs using two methods: direct prediction of discrimination values and response-based calibration using LLM answers as synthetic student responses. The findings indicate that while LLMs show some non-random signal related to item discrimination, they do not yet reliably capture how assessment items distinguish between students of different proficiency levels, with the best-performing models achieving only a Spearman correlation of 0.241. AI
IMPACT LLMs currently lack the nuanced understanding to reliably assess student proficiency differences, indicating a gap in their application for educational evaluation.