PulseAugur / Brief
EN
LIVE 12:41:29

Brief

last 24h
[1/1] 224 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

  1. MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

    A new benchmark, MA-ProofBench, has been introduced to evaluate Large Language Models (LLMs) on theorem proving within mathematical analysis. The benchmark features 200 formalized theorems across six core topics, divided into undergraduate (Level I) and Ph.D. qualifying (Level II) difficulty levels. Current models, including GPT-5.5, demonstrate poor performance, with GPT-5.5 achieving only 16% Pass@8 on Level I and 5% on Level II, highlighting significant gaps in formal reasoning capabilities. Failure modes identified include Mathlib hallucinations and incomplete proofs, with a notable difference between informal and formal reasoning performance. AI

    IMPACT Highlights limitations in current LLMs for advanced formal reasoning, indicating a need for improved capabilities in mathematical theorem proving.