PulseAugur
EN
LIVE 22:20:41

New benchmarks assess LLM math reasoning, proof verification

Researchers have introduced new benchmarks and evaluation methods to assess the mathematical reasoning capabilities of large language models. ComBench focuses on Olympiad-level combinatorics, distinguishing between proof reasoning and constructive realization, and found that even top models struggle with these complex tasks. Another approach, TheoremBench, evaluates LLMs on theorem proving in formal mathematics using the Lean4 language, highlighting the need for benchmarks that go beyond competition-style problems to assess performance on longer, dependency-rich mathematical developments. Additionally, a method for strict step-level verification of research-level proofs aims to address LLM unreliability by meticulously checking each deduction step. AI

IMPACT These benchmarks and verification methods will drive progress in LLM mathematical reasoning and formal proof capabilities.

RANK_REASON Multiple research papers introducing new benchmarks and evaluation methodologies for LLMs in mathematical reasoning.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 7 sources. How we write summaries →

COVERAGE [7]

  1. arXiv cs.AI TIER_1 English(EN) · Shunkai Zhang, Haoran Zhang, Yun Luo, Qianjia Cheng, Haodi Lei, Yizhuo Li, Runzhe Zhan, Zhilin Wang, Bangjie Xu, Yucheng Su, Xinmiao Han, Xiaoye Qu, Dongrui Liu, Zhouchen Lin, Yu Qiao, Ning Ding, Yafu Li, Yu Cheng ·

    ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

    arXiv:2606.10479v1 Announce Type: new Abstract: Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier model…

  2. arXiv cs.AI TIER_1 English(EN) · Yifeng Sun ·

    Evaluating Research-Level Math Proofs via Strict Step-Level Verification

    arXiv:2606.10799v1 Announce Type: new Abstract: Large Language Models (LLMs) struggle to rigorously verify complex mathematical proofs. Standard global evaluation approaches suffer from "context poisoning," in which superficially plausible statements mask subtle logical flaws, le…

  3. arXiv cs.AI TIER_1 English(EN) · Yifeng Sun ·

    Evaluating Research-Level Math Proofs via Strict Step-Level Verification

    Large Language Models (LLMs) struggle to rigorously verify complex mathematical proofs. Standard global evaluation approaches suffer from "context poisoning," in which superficially plausible statements mask subtle logical flaws, leading to hallucination or over-skepticism. To ad…

  4. arXiv cs.AI TIER_1 English(EN) · George Tsoukalas, Anton Kovsharov, Sergey Shirobokov, Anja Surina, Moritz Firsching, Gergely B\'erczi, Francisco J. R. Ruiz, Arun Suggala, Adam Zsolt Wagner, Eric Wieser, Lei Yu, Aja Huang, Mikl\'os Z. Horv\'ath, Andrew Ferraiuolo, Henryk Michalewski, Ed… ·

    Advancing Mathematics Research with AI-Driven Formal Proof Search

    arXiv:2605.22763v2 Announce Type: replace Abstract: Large language models (LLMs) increasingly excel at mathematical reasoning, but their unreliability limits their utility in mathematics research. A mitigation is using LLMs to generate formal proofs in languages like Lean. We per…

  5. arXiv cs.AI TIER_1 English(EN) · QuocViet Pham, Elvir Karimov, Andrey Galichin, Ivan Oseledets ·

    TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

    arXiv:2606.09450v1 Announce Type: new Abstract: LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-…

  6. Hugging Face Daily Papers TIER_1 English(EN) ·

    TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

    LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce The…

  7. arXiv cs.AI TIER_1 English(EN) · Ivan Oseledets ·

    TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

    LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce The…