Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 4h

Evaluation of LLMs for Mathematical Formalization in Lean

A new research paper evaluates the performance of various Large Language Models (LLMs) in generating formal mathematical proofs using the Lean 4 theorem prover. The study employed pass@k and refine@k metrics on subsets of the miniF2F and miniCTX datasets. Gemini 3.1 Pro and Claude Opus 4.7 demonstrated the highest success rates, with Gemini achieving 92% on miniF2F and Opus reaching 86% on miniCTX. For cost-efficiency, NVIDIA Nemotron 3 Super and GPT-OSS 120B offered competitive accuracies at a low cost per proof. AI

IMPACT This research highlights LLM capabilities in formal mathematics, potentially aiding theorem proving and mathematical research.

GPT-OSS 120B
Claude Opus 4.7
Gemini 3.1 Pro
miniF2F
NVIDIA Nemotron 3 Super
miniCTX