PulseAugur
EN
LIVE 06:59:36

LLMs match and exceed human examiner agreement on UK GCSE exams

A new dataset of 32,534 double-marked real student responses to UK GCSE mock exams has been introduced, covering 328 questions across five subjects, including handwritten work. Researchers found that current large language models align closely with examiner consensus, even surpassing the agreement rate between human examiners on subjective tasks like English essays and complex handwritten mathematics papers. The study suggests that LLMs offer cost-effective automated marking solutions, with agreement levels remaining high regardless of model size. AI

IMPACT LLMs demonstrate strong potential for cost-effective automated grading in educational settings, even for subjective and complex tasks.

RANK_REASON The cluster contains an academic paper introducing a new dataset and evaluation of LLM performance on a specific benchmark. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

LLMs match and exceed human examiner agreement on UK GCSE exams

COVERAGE [1]

  1. arXiv cs.CL TIER_1 English(EN) · Malachy Fox, Kavi Samra, Paul Jung ·

    LLM Performance on a Real, Double-Marked GCSE Benchmark

    arXiv:2606.24973v1 Announce Type: new Abstract: We introduce a dataset of 32,534 double-marked real student responses to GCSE mock exams (GCSEs are the UK's national exams, taken at age ~16), spanning 328 questions across five subjects and including handwritten work. We test whet…