Evaluation of LLMs for Mathematical Formalization in Lean
A new research paper evaluates the performance of various Large Language Models (LLMs) in generating formal mathematical proofs using the Lean 4 theorem prover. The study employed pass@k and refine@k metrics on subsets of the miniF2F and miniCTX datasets. Gemini 3.1 Pro and Claude Opus 4.7 demonstrated the highest success rates, with Gemini achieving 92% on miniF2F and Opus reaching 86% on miniCTX. For cost-efficiency, NVIDIA Nemotron 3 Super and GPT-OSS 120B offered competitive accuracies at a low cost per proof. AI
IMPACT This research highlights LLM capabilities in formal mathematics, potentially aiding theorem proving and mathematical research.