A new study published on arXiv explores the use of large language models (LLMs) for grading graduate-level software engineering assignments. Researchers found that while LLMs like Grok and GPT can reduce educator workload, they exhibit significant inconsistencies in grading, both within and between models. The study also highlights that the models' grading standards can drift away from human expert scores due to continuous interaction history, potentially introducing systemic unfairness. AI
IMPACT Highlights the need for careful implementation of LLMs in education to ensure fairness and consistency in grading.
RANK_REASON The cluster contains an academic paper detailing research findings on LLM capabilities.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →