PulseAugur
EN
LIVE 17:55:55

New benchmark MINIF2F-DAFNY tests LLMs for mathematical theorem proving

Researchers have developed MINIF2F-DAFNY, a new benchmark for evaluating Large Language Models (LLMs) in mathematical theorem proving. This system translates the miniF2F benchmark to Dafny, an auto-active verifier, enabling LLMs to guide proof generation while Dafny's automated theorem prover handles low-level details. In evaluations, the best-performing LLM, Claude Opus-4.6, achieved a 62.7% cumulative pass rate, significantly improving upon the baseline performance. AI

IMPACT This benchmark could accelerate the development of LLMs capable of complex mathematical reasoning and formal verification.

RANK_REASON The cluster describes a new benchmark and evaluation for LLMs in mathematical theorem proving, published on arXiv. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New benchmark MINIF2F-DAFNY tests LLMs for mathematical theorem proving

COVERAGE [1]

  1. arXiv cs.LG TIER_1 English(EN) · Mantas Baksys, Stefan Zetzsche, Olivier Bouissou, Sean B. Holden ·

    MINIF2F-DAFNY: LLM-Guided Mathematical Theorem Proving via Auto-Active Verification

    arXiv:2512.10187v3 Announce Type: replace Abstract: LLMs excel at reasoning, but validating their steps remains challenging. Formal verification offers a solution through mechanically checkable proofs. Interactive theorem provers (ITPs) dominate mathematical reasoning but require…