Brief · PulseAugur

RESEARCH · arXiv cs.CL English(EN) · 1d · [2 sources]

LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

A new study published on arXiv investigates the ability of large language models (LLMs) to measure item discrimination in educational assessments. Researchers evaluated 42 LLMs using two methods: direct prediction of discrimination values and response-based calibration using LLM answers as synthetic student responses. The findings indicate that while LLMs show some non-random signal related to item discrimination, they do not yet reliably capture how assessment items distinguish between students of different proficiency levels, with the best-performing models achieving only a Spearman correlation of 0.241. AI

IMPACT LLMs currently lack the nuanced understanding to reliably assess student proficiency differences, indicating a gap in their application for educational evaluation.

Hugging Face
LLMs
arXiv
Large Language Models
DagsHub
Spearman's rank correlation coefficient
Item discrimination
Reading Comprehension Assessment through Retelling: Performance Profiles of Children with Dyslexia and Language-Based Learning Disability
Classical Test Theory (CTT)