Researchers have developed MINIF2F-DAFNY, a new benchmark for evaluating Large Language Models (LLMs) in mathematical theorem proving. This system translates the miniF2F benchmark to Dafny, an auto-active verifier, enabling LLMs to guide proof generation while Dafny's automated theorem prover handles low-level details. In evaluations, the best-performing LLM, Claude Opus-4.6, achieved a 62.7% cumulative pass rate, significantly improving upon the baseline performance. AI
IMPACT This benchmark could accelerate the development of LLMs capable of complex mathematical reasoning and formal verification.
RANK_REASON The cluster describes a new benchmark and evaluation for LLMs in mathematical theorem proving, published on arXiv. [lever_c_demoted from research: ic=1 ai=1.0]
- arXiv
- Claude Opus-4.6
- Dafny
- Instituto Todos Pela Saúde
- Mantas Baksys
- miniF2F
- MINIF2F-DAFNY
- satisfiability modulo theories
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →