PulseAugur
实时 22:21:35
English(EN) TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

新基准评估LLM的数学推理和证明验证能力

研究人员引入了新的基准和评估方法来评估大型语言模型的数学推理能力。ComBench侧重于奥林匹克级别的组合数学,区分证明推理和构造性实现,并发现即使是顶级模型也难以应对这些复杂任务。另一种方法TheoremBench使用Lean4语言评估LLM在形式数学中的定理证明能力,强调需要超越竞赛式问题来评估模型在更长、依赖性更强的数学发展中的表现。此外,一种用于研究级证明的严格步骤级验证方法旨在通过仔细检查每个推理步骤来解决LLM的不可靠性问题。 AI

影响 这些基准和验证方法将推动LLM在数学推理和形式证明能力方面的进步。

排序理由 多篇研究论文介绍了用于LLM数学推理的新基准和评估方法。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 7 个来源。 我们如何撰写摘要 →

报道来源 [7]

  1. arXiv cs.AI TIER_1 English(EN) · Shunkai Zhang, Haoran Zhang, Yun Luo, Qianjia Cheng, Haodi Lei, Yizhuo Li, Runzhe Zhan, Zhilin Wang, Bangjie Xu, Yucheng Su, Xinmiao Han, Xiaoye Qu, Dongrui Liu, Zhouchen Lin, Yu Qiao, Ning Ding, Yafu Li, Yu Cheng ·

    ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

    arXiv:2606.10479v1 Announce Type: new Abstract: Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier model…

  2. arXiv cs.AI TIER_1 English(EN) · Yifeng Sun ·

    Evaluating Research-Level Math Proofs via Strict Step-Level Verification

    arXiv:2606.10799v1 Announce Type: new Abstract: Large Language Models (LLMs) struggle to rigorously verify complex mathematical proofs. Standard global evaluation approaches suffer from "context poisoning," in which superficially plausible statements mask subtle logical flaws, le…

  3. arXiv cs.AI TIER_1 English(EN) · Yifeng Sun ·

    通过严格的逐级验证评估研究级数学证明

    Large Language Models (LLMs) struggle to rigorously verify complex mathematical proofs. Standard global evaluation approaches suffer from "context poisoning," in which superficially plausible statements mask subtle logical flaws, leading to hallucination or over-skepticism. To ad…

  4. arXiv cs.AI TIER_1 English(EN) · George Tsoukalas, Anton Kovsharov, Sergey Shirobokov, Anja Surina, Moritz Firsching, Gergely B\'erczi, Francisco J. R. Ruiz, Arun Suggala, Adam Zsolt Wagner, Eric Wieser, Lei Yu, Aja Huang, Mikl\'os Z. Horv\'ath, Andrew Ferraiuolo, Henryk Michalewski, Ed… ·

    利用人工智能驱动的自动定理证明搜索推进数学研究

    arXiv:2605.22763v2 Announce Type: replace Abstract: Large language models (LLMs) increasingly excel at mathematical reasoning, but their unreliability limits their utility in mathematics research. A mitigation is using LLMs to generate formal proofs in languages like Lean. We per…

  5. arXiv cs.AI TIER_1 English(EN) · QuocViet Pham, Elvir Karimov, Andrey Galichin, Ivan Oseledets ·

    TheoremBench:在形式数学中评估LLM的定理证明能力

    arXiv:2606.09450v1 Announce Type: new Abstract: LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-…

  6. Hugging Face Daily Papers TIER_1 English(EN) ·

    TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

    LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce The…

  7. arXiv cs.AI TIER_1 English(EN) · Ivan Oseledets ·

    TheoremBench:在形式数学中评估LLM的定理证明能力

    LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce The…