MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis
A new benchmark, MA-ProofBench, has been introduced to evaluate Large Language Models (LLMs) on theorem proving within mathematical analysis. The benchmark features 200 formalized theorems across six core topics, divided into undergraduate (Level I) and Ph.D. qualifying (Level II) difficulty levels. Current models, including GPT-5.5, demonstrate poor performance, with GPT-5.5 achieving only 16% Pass@8 on Level I and 5% on Level II, highlighting significant gaps in formal reasoning capabilities. Failure modes identified include Mathlib hallucinations and incomplete proofs, with a notable difference between informal and formal reasoning performance. AI
IMPACT Highlights limitations in current LLMs for advanced formal reasoning, indicating a need for improved capabilities in mathematical theorem proving.