Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 1w

GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

A new benchmark called GTBench has been developed to evaluate the capabilities of large language models as mathematical research assistants, specifically in the field of graph theory. The benchmark features 63 problems categorized by difficulty, ranging from undergraduate concepts to graduate-level proof construction. When tested, GPT-5 demonstrated strong performance across all levels, while other models like Llama 3.3 showed significant degradation, particularly on complex proof tasks. AI

IMPACT Establishes a new evaluation standard for LLM reasoning in advanced mathematics, highlighting performance disparities.

Llama 3.3 70B
GPT-5
Claude Sonnet 4.6
Gemini 2.5 Flash-Lite
Mistral Large 3
Diestel's Graph Theory
GTBench