PulseAugur
EN
LIVE 06:17:08

Flaws Found in Lean Theorem Proving Benchmarks and RL Model Inference

Researchers have identified significant flaws in the formal benchmarking of Lean theorem-proving datasets, uncovering thousands of issues including counterexamples and vacuous theorems. A separate study on RL-trained Lean theorem provers reveals that these models suffer from inference-time mode collapse, where increasing sampling budget yields no additional solved theorems. However, interventions like structured tactic skeletons can improve performance, suggesting that inference-time diversity is a crucial, orthogonal axis for enhancing RL-trained provers. AI

IMPACT Highlights critical issues in evaluating AI for formal reasoning, impacting the reliability of benchmarks and the development of theorem-proving agents.

RANK_REASON Two arXiv papers detailing issues with formal benchmarking of Lean theorem provers and diagnostic studies of RL-trained provers.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

Flaws Found in Lean Theorem Proving Benchmarks and RL Model Inference

COVERAGE [3]

  1. arXiv cs.LG TIER_1 English(EN) · Leni Aniva, Iori Oikawa, David Dill, Clark Barrett ·

    Nazrin: An Atomic Neural Proof Automation Tactic in Lean 4

    arXiv:2602.18767v3 Announce Type: replace-cross Abstract: In Machine-Assisted Theorem Proving, a theorem proving agent searches for a sequence of expressions and tactics that can prove a statement in a proof assistant. In this work, we introduce several novel concepts and capabil…

  2. arXiv cs.AI TIER_1 English(EN) · Pawan Sasanka Ammanamanchi, Siddharth Bhat, Stella Biderman ·

    Faults in Our Formal Benchmarking: Dataset Defects and Evaluation Failures in Lean Theorem Proving

    arXiv:2606.29493v1 Announce Type: new Abstract: Benchmarks for LLM-assisted theorem proving in Lean are often treated as intrinsically reliable because every solved instance comes with a machine-checked proof. However, the kernel only checks that a proof establishes a \emph{forma…

  3. arXiv cs.AI TIER_1 English(EN) · Zachary Burton ·

    Inference-Time Diversity in RL-Trained Lean Theorem Provers: A Diagnostic Study

    arXiv:2601.16172v3 Announce Type: replace Abstract: RL-trained Lean theorem provers mode-collapse at inference time: on miniF2F-test with DeepSeek-Prover-V1.5-RL, doubling the i.i.d.\ sampling budget from $k{=}32$ to $k{=}64$ produces zero additional solved theorems (42/244 in bo…