A new benchmark called GTBench has been developed to evaluate the capabilities of large language models as mathematical research assistants, specifically in the field of graph theory. The benchmark features 63 problems categorized by difficulty, ranging from undergraduate concepts to graduate-level proof construction. When tested, GPT-5 demonstrated strong performance across all levels, while other models like Llama 3.3 showed significant degradation, particularly on complex proof tasks. AI
IMPACT Establishes a new evaluation standard for LLM reasoning in advanced mathematics, highlighting performance disparities.
RANK_REASON The cluster contains an academic paper introducing a new benchmark for evaluating LLMs. [lever_c_demoted from research: ic=1 ai=1.0]
- Claude Sonnet 4.6
- Diestel's Graph Theory
- Gemini 2.5 Flash-Lite
- GPT-5
- GTBench
- Llama 3.3 70B
- Mistral Large 3
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →