A new study published on arXiv investigates the ability of large language models (LLMs) to measure item discrimination in educational assessments. Researchers evaluated 42 LLMs using two methods: direct prediction of discrimination values and response-based calibration using LLM answers as synthetic student responses. The findings indicate that while LLMs show some non-random signal related to item discrimination, they do not yet reliably capture how assessment items distinguish between students of different proficiency levels, with the best-performing models achieving only a Spearman correlation of 0.241. AI
IMPACT LLMs currently lack the nuanced understanding to reliably assess student proficiency differences, indicating a gap in their application for educational evaluation.
RANK_REASON The cluster contains a research paper published on arXiv detailing findings about LLM capabilities in educational assessment.
- arXiv
- Classical Test Theory (CTT)
- DagsHub
- Hugging Face
- Item discrimination
- Large Language Models
- LLMs
- Reading Comprehension Assessment through Retelling: Performance Profiles of Children with Dyslexia and Language-Based Learning Disability
- Spearman's rank correlation coefficient
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →