English(EN) LLMs as Teaching Assistants for Mathematics Exam Grading: Reliability, and Practical Usability

LLM被评估为数学考试评分助手

作者 PulseAugur 编辑部 · [1 个来源] · 2026-07-03 04:00

一篇新研究论文评估了六种大型语言模型（LLM）作为本科数学考试评分助手的有效性。该研究比较了Gemini 3.1 Pro Extended、Gemini 3.5 Flash、ChatGPT 5.5 Pro Extended、ChatGPT 5.5 Thinking、Claude Pro Opus 4.7和Claude Sonnet 4.6。研究人员发现，采用更宽松的部分分数提示策略可以提高所有评估模型的评分准确性，其中ChatGPT 5.5 Thinking在平均问题级错误方面最低，而Gemini 3.1 Pro Extended在总分错误方面最低。然而，在更严格提示下，Gemini 3.1 Pro Extended在总分方面表现出最强的相关性，这表明分数校准和排名保持是不同的评分目标。 AI

影响 LLM在学术环境中，作为自动化和提高复杂、开放式评估一致性的工具，展现出潜力。

排序理由研究论文评估LLM在特定任务上的表现。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Aastha Sapkota, M. G. Sarwar Murshed · 2026-07-03 04:00

LLMs as Teaching Assistants for Mathematics Exam Grading: Reliability, and Practical Usability

arXiv:2607.01247v1 Announce Type: cross Abstract: Open-ended mathematics exams are valuable because they assess reasoning, proof construction, algorithmic thinking, and communication of intermediate steps. They are also difficult to grade at scale because instructors must apply p…

报道来源 [1]

LLMs as Teaching Assistants for Mathematics Exam Grading: Reliability, and Practical Usability

相关实体

相关话题