PulseAugur
EN
LIVE 12:41:35

New Benchmark Reveals LLMs Struggle with Mathematical Analysis Theorem Proving

A new benchmark, MA-ProofBench, has been introduced to evaluate Large Language Models (LLMs) on theorem proving within mathematical analysis. The benchmark features 200 formalized theorems across six core topics, divided into undergraduate (Level I) and Ph.D. qualifying (Level II) difficulty levels. Current models, including GPT-5.5, demonstrate poor performance, with GPT-5.5 achieving only 16% Pass@8 on Level I and 5% on Level II, highlighting significant gaps in formal reasoning capabilities. Failure modes identified include Mathlib hallucinations and incomplete proofs, with a notable difference between informal and formal reasoning performance. AI

IMPACT Highlights limitations in current LLMs for advanced formal reasoning, indicating a need for improved capabilities in mathematical theorem proving.

RANK_REASON The cluster describes a new academic paper introducing a benchmark for evaluating LLMs on a specific research task. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.AI TIER_1 English(EN) · Lushi Pu, Weiming Zhang, Xinheng Xie, Zixuan Fu, Bingxiang He, Hongya Lyu, Xin Li, Jie Zhou, Yudong Wang ·

    MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

    arXiv:2606.13782v1 Announce Type: new Abstract: Large Language Models (LLMs) have made notable progress in automated theorem proving, yet existing formal benchmarks remain limited in both mathematical coverage and difficulty. Most are concentrated in areas that are easier to form…