English(EN) LLM Performance on a Real, Double-Marked GCSE Benchmark

LLM在英国GCSE考试中匹配并超越了人类考官的一致性

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-25 04:00

引入了一个包含32,534份真实学生对英国GCSE模拟考试的双重评分回复的新数据集，涵盖了五个学科的328个问题，包括手写作业。研究人员发现，当前的大型语言模型与考官的共识非常接近，甚至在主观任务（如英语论文和复杂的手写数学试卷）上超过了人类考官之间的一致性。研究表明，LLM提供了具有成本效益的自动评分解决方案，并且无论模型大小如何，一致性水平都保持很高。 AI

影响 LLM在教育环境中展示了具有成本效益的自动评分的强大潜力，即使对于主观和复杂的任务也是如此。

排序理由该集群包含一篇学术论文，该论文引入了一个新的数据集并评估了LLM在特定基准测试上的表现。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CL TIER_1 English(EN) · Malachy Fox, Kavi Samra, Paul Jung · 2026-06-25 04:00

LLM Performance on a Real, Double-Marked GCSE Benchmark

arXiv:2606.24973v1 Announce Type: new Abstract: We introduce a dataset of 32,534 double-marked real student responses to GCSE mock exams (GCSEs are the UK's national exams, taken at age ~16), spanning 328 questions across five subjects and including handwritten work. We test whet…

报道来源 [1]

LLM Performance on a Real, Double-Marked GCSE Benchmark

相关实体

相关话题