A new benchmark, MA-ProofBench, has been introduced to evaluate Large Language Models (LLMs) on theorem proving within mathematical analysis. The benchmark features 200 formalized theorems across six core topics, divided into undergraduate (Level I) and Ph.D. qualifying (Level II) difficulty levels. Current models, including GPT-5.5, demonstrate poor performance, with GPT-5.5 achieving only 16% Pass@8 on Level I and 5% on Level II, highlighting significant gaps in formal reasoning capabilities. Failure modes identified include Mathlib hallucinations and incomplete proofs, with a notable difference between informal and formal reasoning performance. AI
IMPACT Highlights limitations in current LLMs for advanced formal reasoning, indicating a need for improved capabilities in mathematical theorem proving.
RANK_REASON The cluster describes a new academic paper introducing a benchmark for evaluating LLMs on a specific research task. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →