English(EN) Inference-Time Diversity in RL-Trained Lean Theorem Provers: A Diagnostic Study

精简定理证明基准和RL模型推理中发现的缺陷

作者 PulseAugur 编辑部 · [3 个来源] · 2026-06-30 04:00

研究人员在精简定理证明数据集的正式基准测试中发现了重大缺陷，揭示了数千个问题，包括反例和空泛定理。另一项关于RL训练的精简定理证明器的研究表明，这些模型存在推理时模式崩溃的问题，即增加采样预算并不会带来更多已解决的定理。然而，结构化策略骨架等干预措施可以提高性能，这表明推理时多样性是增强RL训练证明器的关键且独立的维度。 AI

影响突出了在评估形式推理AI方面存在的关键问题，影响了基准测试的可靠性和定理证明代理的开发。

排序理由两篇arXiv论文，详细介绍了精简定理证明器的形式基准测试问题以及对RL训练证明器的诊断研究。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。我们如何撰写摘要 →

报道来源 [3]

arXiv cs.LG TIER_1 English(EN) · Leni Aniva, Iori Oikawa, David Dill, Clark Barrett · 2026-07-01 04:00

Nazrin: An Atomic Neural Proof Automation Tactic in Lean 4

arXiv:2602.18767v3 Announce Type: replace-cross Abstract: In Machine-Assisted Theorem Proving, a theorem proving agent searches for a sequence of expressions and tactics that can prove a statement in a proof assistant. In this work, we introduce several novel concepts and capabil…
arXiv cs.AI TIER_1 English(EN) · Pawan Sasanka Ammanamanchi, Siddharth Bhat, Stella Biderman · 2026-06-30 04:00

我们形式化基准测试中的缺陷：精简定理证明中的数据集缺陷和评估失败

arXiv:2606.29493v1 Announce Type: new Abstract: Benchmarks for LLM-assisted theorem proving in Lean are often treated as intrinsically reliable because every solved instance comes with a machine-checked proof. However, the kernel only checks that a proof establishes a \emph{forma…
arXiv cs.AI TIER_1 English(EN) · Zachary Burton · 2026-06-30 04:00

RL训练的精简定理证明器中的推理时间多样性：一项诊断研究

arXiv:2601.16172v3 Announce Type: replace Abstract: RL-trained Lean theorem provers mode-collapse at inference time: on miniF2F-test with DeepSeek-Prover-V1.5-RL, doubling the i.i.d.\ sampling budget from $k{=}32$ to $k{=}64$ produces zero additional solved theorems (42/244 in bo…

报道来源 [3]

Nazrin: An Atomic Neural Proof Automation Tactic in Lean 4

我们形式化基准测试中的缺陷：精简定理证明中的数据集缺陷和评估失败

RL训练的精简定理证明器中的推理时间多样性：一项诊断研究

相关实体

相关话题