PulseAugur
EN
LIVE 11:44:52

LLMs struggle to measure student proficiency differences in assessments

A new study published on arXiv investigates the ability of large language models (LLMs) to measure item discrimination in educational assessments. Researchers evaluated 42 LLMs using two methods: direct prediction of discrimination values and response-based calibration using LLM answers as synthetic student responses. The findings indicate that while LLMs show some non-random signal related to item discrimination, they do not yet reliably capture how assessment items distinguish between students of different proficiency levels, with the best-performing models achieving only a Spearman correlation of 0.241. AI

IMPACT LLMs currently lack the nuanced understanding to reliably assess student proficiency differences, indicating a gap in their application for educational evaluation.

RANK_REASON The cluster contains a research paper published on arXiv detailing findings about LLM capabilities in educational assessment.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.CL TIER_1 English(EN) · Han Chen, Ming Li, Chenguang Wang, Yijun Liang, Dawei Zhou, Hong jiao, Tianyi Zhou ·

    LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

    arXiv:2606.18709v1 Announce Type: new Abstract: Item discrimination is a fundamental psychometric property of educational assessment, which measures whether an item meaningfully distinguishes students with higher proficiency from students with lower proficiency. While various exi…

  2. arXiv cs.CL TIER_1 English(EN) · Tianyi Zhou ·

    LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

    Item discrimination is a fundamental psychometric property of educational assessment, which measures whether an item meaningfully distinguishes students with higher proficiency from students with lower proficiency. While various existing works have explored whether large language…