PulseAugur
EN
LIVE 08:30:16

LLMs evaluated for formal math proofs in Lean 4

A new research paper evaluates the performance of various Large Language Models (LLMs) in generating formal mathematical proofs using the Lean 4 theorem prover. The study employed pass@k and refine@k metrics on subsets of the miniF2F and miniCTX datasets. Gemini 3.1 Pro and Claude Opus 4.7 demonstrated the highest success rates, with Gemini achieving 92% on miniF2F and Opus reaching 86% on miniCTX. For cost-efficiency, NVIDIA Nemotron 3 Super and GPT-OSS 120B offered competitive accuracies at a low cost per proof. AI

IMPACT This research highlights LLM capabilities in formal mathematics, potentially aiding theorem proving and mathematical research.

RANK_REASON The cluster contains an academic paper evaluating LLM performance on a specific task. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.AI TIER_1 English(EN) · Tyson Klingner, Drew Bladek, Escher Crawford, Bohao Chen, Ariel Fu, Kaira Nair, Jarod Alper, Giovanni Inchiostro, Vasily Ilin ·

    Evaluation of LLMs for Mathematical Formalization in Lean

    arXiv:2606.05632v1 Announce Type: new Abstract: Within the past few years, the ability of Large Language Models (LLMs) to generate formal mathematical proofs has improved drastically. We provide a comparison of various LLMs' effectiveness in producing formal proofs in Lean 4 with…