Flaws Found in Lean Theorem Proving Benchmarks and RL Model Inference

By PulseAugur Editorial · [2 sources] · 2026-06-30 04:00

Researchers have identified significant flaws in the formal benchmarking of Lean theorem-proving datasets, uncovering thousands of issues including counterexamples and vacuous theorems. A separate study on RL-trained Lean theorem provers reveals that these models suffer from inference-time mode collapse, where increasing sampling budget yields no additional solved theorems. However, interventions like structured tactic skeletons can improve performance, suggesting that inference-time diversity is a crucial, orthogonal axis for enhancing RL-trained provers. AI

IMPACT Highlights critical issues in evaluating AI for formal reasoning, impacting the reliability of benchmarks and the development of theorem-proving agents.

RANK_REASON Two arXiv papers detailing issues with formal benchmarking of Lean theorem provers and diagnostic studies of RL-trained provers.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

Flaws Found in Lean Theorem Proving Benchmarks and RL Model Inference

COVERAGE [2]

arXiv cs.AI TIER_1 English(EN) · Pawan Sasanka Ammanamanchi, Siddharth Bhat, Stella Biderman · 2026-06-30 04:00

Faults in Our Formal Benchmarking: Dataset Defects and Evaluation Failures in Lean Theorem Proving

arXiv:2606.29493v1 Announce Type: new Abstract: Benchmarks for LLM-assisted theorem proving in Lean are often treated as intrinsically reliable because every solved instance comes with a machine-checked proof. However, the kernel only checks that a proof establishes a \emph{forma…
arXiv cs.AI TIER_1 English(EN) · Zachary Burton · 2026-06-30 04:00

Inference-Time Diversity in RL-Trained Lean Theorem Provers: A Diagnostic Study

arXiv:2601.16172v3 Announce Type: replace Abstract: RL-trained Lean theorem provers mode-collapse at inference time: on miniF2F-test with DeepSeek-Prover-V1.5-RL, doubling the i.i.d.\ sampling budget from $k{=}32$ to $k{=}64$ produces zero additional solved theorems (42/244 in bo…

COVERAGE [2]

Faults in Our Formal Benchmarking: Dataset Defects and Evaluation Failures in Lean Theorem Proving

Inference-Time Diversity in RL-Trained Lean Theorem Provers: A Diagnostic Study

RELATED ENTITIES

RELATED TOPICS