PulseAugur
EN
LIVE 21:34:22

LLM grading inconsistent, risks unfairness in software engineering courses

A new study published on arXiv explores the use of large language models (LLMs) for grading graduate-level software engineering assignments. Researchers found that while LLMs like Grok and GPT can reduce educator workload, they exhibit significant inconsistencies in grading, both within and between models. The study also highlights that the models' grading standards can drift away from human expert scores due to continuous interaction history, potentially introducing systemic unfairness. AI

IMPACT Highlights the need for careful implementation of LLMs in education to ensure fairness and consistency in grading.

RANK_REASON The cluster contains an academic paper detailing research findings on LLM capabilities.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Qilin Zhou, Zhuo Wang, Yue Li, W. K. Chan ·

    Impacts of Histories and Models on LLM Grading: A Study in Advanced Software Engineering Courses

    arXiv:2606.08400v1 Announce Type: cross Abstract: Graduate-level research reading report assessment creates a substantial labor burden for educators. While large language models (LLMs) hold great potential for automating academic grading, their reliability for this specialized ta…

  2. arXiv cs.AI TIER_1 English(EN) · W. K. Chan ·

    Impacts of Histories and Models on LLM Grading: A Study in Advanced Software Engineering Courses

    Graduate-level research reading report assessment creates a substantial labor burden for educators. While large language models (LLMs) hold great potential for automating academic grading, their reliability for this specialized task remains understudied, particularly regarding gr…