English(EN)TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics
新基准评估LLM的数学推理和证明验证能力
作者PulseAugur 编辑部·[7 个来源]·
研究人员引入了新的基准和评估方法来评估大型语言模型的数学推理能力。ComBench侧重于奥林匹克级别的组合数学,区分证明推理和构造性实现,并发现即使是顶级模型也难以应对这些复杂任务。另一种方法TheoremBench使用Lean4语言评估LLM在形式数学中的定理证明能力,强调需要超越竞赛式问题来评估模型在更长、依赖性更强的数学发展中的表现。此外,一种用于研究级证明的严格步骤级验证方法旨在通过仔细检查每个推理步骤来解决LLM的不可靠性问题。
AI
arXiv:2606.10479v1 Announce Type: new Abstract: Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier model…
arXiv:2606.10799v1 Announce Type: new Abstract: Large Language Models (LLMs) struggle to rigorously verify complex mathematical proofs. Standard global evaluation approaches suffer from "context poisoning," in which superficially plausible statements mask subtle logical flaws, le…
Large Language Models (LLMs) struggle to rigorously verify complex mathematical proofs. Standard global evaluation approaches suffer from "context poisoning," in which superficially plausible statements mask subtle logical flaws, leading to hallucination or over-skepticism. To ad…
arXiv cs.AI
TIER_1English(EN)·George Tsoukalas, Anton Kovsharov, Sergey Shirobokov, Anja Surina, Moritz Firsching, Gergely B\'erczi, Francisco J. R. Ruiz, Arun Suggala, Adam Zsolt Wagner, Eric Wieser, Lei Yu, Aja Huang, Mikl\'os Z. Horv\'ath, Andrew Ferraiuolo, Henryk Michalewski, Ed…·
arXiv:2605.22763v2 Announce Type: replace Abstract: Large language models (LLMs) increasingly excel at mathematical reasoning, but their unreliability limits their utility in mathematics research. A mitigation is using LLMs to generate formal proofs in languages like Lean. We per…
arXiv cs.AI
TIER_1English(EN)·QuocViet Pham, Elvir Karimov, Andrey Galichin, Ivan Oseledets·
arXiv:2606.09450v1 Announce Type: new Abstract: LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-…
LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce The…
LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce The…