A new research paper evaluates the effectiveness of six large language models (LLMs) as assistants for grading undergraduate mathematics exams. The study compared Gemini 3.1 Pro Extended, Gemini 3.5 Flash, ChatGPT 5.5 Pro Extended, ChatGPT 5.5 Thinking, Claude Pro Opus 4.7, and Claude Sonnet 4.6. Researchers found that using a more liberal partial-credit prompting policy improved grading accuracy for all evaluated models, with ChatGPT 5.5 Thinking showing the lowest average question-level error and Gemini 3.1 Pro Extended achieving the lowest total-score error. However, Gemini 3.1 Pro Extended under a stricter prompt exhibited the strongest correlation in total scores, indicating that point calibration and rank preservation are distinct grading goals. AI
IMPACT LLMs show potential as tools for automating and improving the consistency of grading complex, open-ended assessments in academic settings.
RANK_REASON Research paper evaluating LLM performance on a specific task. [lever_c_demoted from research: ic=1 ai=1.0]
- ChatGPT 5.5 Pro Extended
- ChatGPT 5.5 Thinking
- Claude Pro Opus 4.7
- Claude Sonnet 4.6
- Gemini 3.1 Pro Extended
- Gemini 3.5 Flash
- M. G. Sarwar Murshed
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →