New benchmarks assess LLM math reasoning, proof verification

By PulseAugur Editorial · [13 sources] · 2026-05-29 00:00

Researchers have introduced new benchmarks and evaluation methods to assess the mathematical reasoning capabilities of large language models. ComBench focuses on Olympiad-level combinatorics, distinguishing between proof reasoning and constructive realization, and found that even top models struggle with these complex tasks. Another approach, TheoremBench, evaluates LLMs on theorem proving in formal mathematics using the Lean4 language, highlighting the need for benchmarks that go beyond competition-style problems to assess performance on longer, dependency-rich mathematical developments. Additionally, a method for strict step-level verification of research-level proofs aims to address LLM unreliability by meticulously checking each deduction step. AI

IMPACT These benchmarks and verification methods will drive progress in LLM mathematical reasoning and formal proof capabilities.

RANK_REASON Multiple research papers introducing new benchmarks and evaluation methodologies for LLMs in mathematical reasoning.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 13 sources. How we write summaries →

COVERAGE [13]

arXiv cs.AI TIER_1 English(EN) · Jiacheng Chen, Xinyu Zhang, Shunkai Zhang, Yanmohan Wang, Lin Li, Tiancheng Qin, Qin Wang, Zhengmao Zhu, Tianle Li, Jingyang Li, Zehan Li, Binyang Jiang, Jin Zhu, Han Ding, Fei Yu, Chenyu Du, Zijian Song, Jiayuan Song, Zhi Zhang, Yunan Huang, Weiyu Cheng… · 2026-06-12 04:00

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

arXiv:2606.13473v1 Announce Type: cross Abstract: We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities -- proof generation, proof verification, and …
arXiv cs.AI TIER_1 English(EN) · Joshua Ong Jun Leang, Zheng Zhao, Mihaela C\u{a}t\u{a}lina Stoian, Qiyuan Xu, Haonan Li, Wenda Li, Shay B. Cohen, Eleonora Giunchiglia · 2026-06-12 04:00

Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

arXiv:2606.12594v1 Announce Type: new Abstract: Modern Lean theorem provers achieve strong performance only with substantial training and inference compute, driven in part by scarce verified proof data and the long reasoning traces of formal proof search, making both supervised f…
arXiv cs.AI TIER_1 English(EN) · Yu Cheng · 2026-06-11 15:27

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities -- proof generation, proof verification, and critique-conditioned proof repair -- using a defen…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-11 00:00

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

MaxProof is a test-time scaling framework that enhances mathematical proof generation by combining multiple proof-oriented capabilities and using population-level search with tournament selection to achieve competitive performance on high-level mathematical competitions.
arXiv cs.AI TIER_1 English(EN) · Yifeng Sun · 2026-06-10 04:00

Evaluating Research-Level Math Proofs via Strict Step-Level Verification

arXiv:2606.10799v1 Announce Type: new Abstract: Large Language Models (LLMs) struggle to rigorously verify complex mathematical proofs. Standard global evaluation approaches suffer from "context poisoning," in which superficially plausible statements mask subtle logical flaws, le…
arXiv cs.AI TIER_1 English(EN) · Shunkai Zhang, Haoran Zhang, Yun Luo, Qianjia Cheng, Haodi Lei, Yizhuo Li, Runzhe Zhan, Zhilin Wang, Bangjie Xu, Yucheng Su, Xinmiao Han, Xiaoye Qu, Dongrui Liu, Zhouchen Lin, Yu Qiao, Ning Ding, Yafu Li, Yu Cheng · 2026-06-10 04:00

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

arXiv:2606.10479v1 Announce Type: new Abstract: Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier model…
arXiv cs.AI TIER_1 English(EN) · Yifeng Sun · 2026-06-09 12:46

Evaluating Research-Level Math Proofs via Strict Step-Level Verification

Large Language Models (LLMs) struggle to rigorously verify complex mathematical proofs. Standard global evaluation approaches suffer from "context poisoning," in which superficially plausible statements mask subtle logical flaws, leading to hallucination or over-skepticism. To ad…
arXiv cs.AI TIER_1 English(EN) · George Tsoukalas, Anton Kovsharov, Sergey Shirobokov, Anja Surina, Moritz Firsching, Gergely B\'erczi, Francisco J. R. Ruiz, Arun Suggala, Adam Zsolt Wagner, Eric Wieser, Lei Yu, Aja Huang, Mikl\'os Z. Horv\'ath, Andrew Ferraiuolo, Henryk Michalewski, Ed… · 2026-06-09 04:00

Advancing Mathematics Research with AI-Driven Formal Proof Search

arXiv:2605.22763v2 Announce Type: replace Abstract: Large language models (LLMs) increasingly excel at mathematical reasoning, but their unreliability limits their utility in mathematics research. A mitigation is using LLMs to generate formal proofs in languages like Lean. We per…
arXiv cs.AI TIER_1 English(EN) · QuocViet Pham, Elvir Karimov, Andrey Galichin, Ivan Oseledets · 2026-06-09 04:00

TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

arXiv:2606.09450v1 Announce Type: new Abstract: LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-09 00:00

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

A new benchmark called ComBench is introduced to evaluate large language models' combinatorial reasoning abilities through Olympiad-level problems that test both proof construction and explicit mathematical constructions.
arXiv cs.AI TIER_1 English(EN) · Ivan Oseledets · 2026-06-08 12:57

TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce The…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-08 12:57

TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce The…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-29 00:00

Distilling LLM Feedback for Lean Theorem Proving

Feedback Distillation improves post-training of reasoning models by using self-distillation with token-level supervision and privileged feedback from language models, offering better diversity and complementary benefits when combined with GRPO.

COVERAGE [13]

RELATED ENTITIES

RELATED TOPICS