Researchers have developed a system called GradeLegal to automate the grading of German legal exam solutions using large language models. The study evaluated 27 different LLMs and various prompting strategies, finding that reasoning-oriented models can achieve high agreement with expert graders in public law, reaching a quadratic weighted kappa of 0.91. However, performance in criminal law was lower, indicating a more challenging task. Ensembling multiple models further improved grading accuracy, offering a potential alternative to top-tier proprietary models. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Automated grading systems could streamline feedback for legal students and reduce bottlenecks for educators.
RANK_REASON The cluster contains an academic paper presenting a new methodology and evaluation of LLMs for a specific task. [lever_c_demoted from research: ic=1 ai=1.0]