A new study published on arXiv investigates the effectiveness of using large language models (LLMs) as automated graders for physics assessments. The research found that LLM performance is highly dependent on the specific task, with models showing strong agreement with human graders on structured questions and code-based plot elements. However, LLMs struggled with essay grading, exhibiting harsher and more variable marks compared to human evaluators, and their ability to rank responses by quality remained low even with mark schemes. The study concludes that LLM validity in assessment hinges more on the task's structure and the reliability of human benchmarks than on the raw capability of the models themselves. AI
IMPACT LLM grading effectiveness is task-dependent, suggesting careful implementation is needed for educational applications.
RANK_REASON Research paper published on arXiv detailing LLM performance in educational assessment. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →