English(EN) MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

新基准揭示大语言模型在数学分析定理证明方面存在困难

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-15 04:00

引入了一个新的基准MA-ProofBench，用于评估大语言模型（LLMs）在数学分析中的定理证明能力。该基准包含200个形式化的定理，涵盖六个核心主题，分为本科（第一级）和博士资格考试（第二级）两个难度级别。包括GPT-5.5在内的当前模型表现不佳，GPT-5.5在第一级上的Pass@8仅为16%，在第二级上仅为5%，凸显了形式推理能力方面的显著差距。已识别出的失败模式包括Mathlib幻觉和不完整的证明，并且非正式推理和正式推理的表现之间存在显著差异。 AI

影响强调了当前大语言模型在高级形式推理方面的局限性，表明需要提高数学定理证明的能力。

排序理由该集群描述了一篇介绍用于评估大语言模型在特定研究任务上表现的新基准的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Lushi Pu, Weiming Zhang, Xinheng Xie, Zixuan Fu, Bingxiang He, Hongya Lyu, Xin Li, Jie Zhou, Yudong Wang · 2026-06-15 04:00

MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

arXiv:2606.13782v1 Announce Type: new Abstract: Large Language Models (LLMs) have made notable progress in automated theorem proving, yet existing formal benchmarks remain limited in both mathematical coverage and difficulty. Most are concentrated in areas that are easier to form…

报道来源 [1]

MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

相关实体

相关话题