LLMs evaluated as math exam grading assistants

By PulseAugur Editorial · [1 sources] · 2026-07-03 04:00

A new research paper evaluates the effectiveness of six large language models (LLMs) as assistants for grading undergraduate mathematics exams. The study compared Gemini 3.1 Pro Extended, Gemini 3.5 Flash, ChatGPT 5.5 Pro Extended, ChatGPT 5.5 Thinking, Claude Pro Opus 4.7, and Claude Sonnet 4.6. Researchers found that using a more liberal partial-credit prompting policy improved grading accuracy for all evaluated models, with ChatGPT 5.5 Thinking showing the lowest average question-level error and Gemini 3.1 Pro Extended achieving the lowest total-score error. However, Gemini 3.1 Pro Extended under a stricter prompt exhibited the strongest correlation in total scores, indicating that point calibration and rank preservation are distinct grading goals. AI

IMPACT LLMs show potential as tools for automating and improving the consistency of grading complex, open-ended assessments in academic settings.

RANK_REASON Research paper evaluating LLM performance on a specific task. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

LLMs evaluated as math exam grading assistants

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Aastha Sapkota, M. G. Sarwar Murshed · 2026-07-03 04:00

LLMs as Teaching Assistants for Mathematics Exam Grading: Reliability, and Practical Usability

arXiv:2607.01247v1 Announce Type: cross Abstract: Open-ended mathematics exams are valuable because they assess reasoning, proof construction, algorithmic thinking, and communication of intermediate steps. They are also difficult to grade at scale because instructors must apply p…

COVERAGE [1]

LLMs as Teaching Assistants for Mathematics Exam Grading: Reliability, and Practical Usability

RELATED ENTITIES

RELATED TOPICS