Brief · PulseAugur

RESEARCH · arXiv cs.AI English(EN) · 2d · [7 sources]

TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

Researchers have introduced new benchmarks and evaluation methods to assess the mathematical reasoning capabilities of large language models. ComBench focuses on Olympiad-level combinatorics, distinguishing between proof reasoning and constructive realization, and found that even top models struggle with these complex tasks. Another approach, TheoremBench, evaluates LLMs on theorem proving in formal mathematics using the Lean4 language, highlighting the need for benchmarks that go beyond competition-style problems to assess performance on longer, dependency-rich mathematical developments. Additionally, a method for strict step-level verification of research-level proofs aims to address LLM unreliability by meticulously checking each deduction step. AI

IMPACT These benchmarks and verification methods will drive progress in LLM mathematical reasoning and formal proof capabilities.