Researchers have identified significant flaws in the formal benchmarking of Lean theorem-proving datasets, uncovering thousands of issues including counterexamples and vacuous theorems. A separate study on RL-trained Lean theorem provers reveals that these models suffer from inference-time mode collapse, where increasing sampling budget yields no additional solved theorems. However, interventions like structured tactic skeletons can improve performance, suggesting that inference-time diversity is a crucial, orthogonal axis for enhancing RL-trained provers. AI
IMPACT Highlights critical issues in evaluating AI for formal reasoning, impacting the reliability of benchmarks and the development of theorem-proving agents.
RANK_REASON Two arXiv papers detailing issues with formal benchmarking of Lean theorem provers and diagnostic studies of RL-trained provers.
- arXiv
- CatalyzeX
- DagsHub
- DeepSeek-Prover-V1.5-RL
- DeepSeek-Prover-V2-7B
- Goedel-Prover
- Gotit.pub
- Hugging Face
- Lean
- miniF2F-test
- Pawan Sasanka Ammanamanchi
- ScienceCast
- Zachary F Burton
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →