PulseAugur
实时 08:22:46
English(EN) Inference-Time Diversity in RL-Trained Lean Theorem Provers: A Diagnostic Study

精简定理证明基准和RL模型推理中发现的缺陷

研究人员在精简定理证明数据集的正式基准测试中发现了重大缺陷,揭示了数千个问题,包括反例和空泛定理。另一项关于RL训练的精简定理证明器的研究表明,这些模型存在推理时模式崩溃的问题,即增加采样预算并不会带来更多已解决的定理。然而,结构化策略骨架等干预措施可以提高性能,这表明推理时多样性是增强RL训练证明器的关键且独立的维度。 AI

影响 突出了在评估形式推理AI方面存在的关键问题,影响了基准测试的可靠性和定理证明代理的开发。

排序理由 两篇arXiv论文,详细介绍了精简定理证明器的形式基准测试问题以及对RL训练证明器的诊断研究。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →

精简定理证明基准和RL模型推理中发现的缺陷

报道来源 [3]

  1. arXiv cs.LG TIER_1 English(EN) · Leni Aniva, Iori Oikawa, David Dill, Clark Barrett ·

    Nazrin: An Atomic Neural Proof Automation Tactic in Lean 4

    arXiv:2602.18767v3 Announce Type: replace-cross Abstract: In Machine-Assisted Theorem Proving, a theorem proving agent searches for a sequence of expressions and tactics that can prove a statement in a proof assistant. In this work, we introduce several novel concepts and capabil…

  2. arXiv cs.AI TIER_1 English(EN) · Pawan Sasanka Ammanamanchi, Siddharth Bhat, Stella Biderman ·

    我们形式化基准测试中的缺陷:精简定理证明中的数据集缺陷和评估失败

    arXiv:2606.29493v1 Announce Type: new Abstract: Benchmarks for LLM-assisted theorem proving in Lean are often treated as intrinsically reliable because every solved instance comes with a machine-checked proof. However, the kernel only checks that a proof establishes a \emph{forma…

  3. arXiv cs.AI TIER_1 English(EN) · Zachary Burton ·

    RL训练的精简定理证明器中的推理时间多样性:一项诊断研究

    arXiv:2601.16172v3 Announce Type: replace Abstract: RL-trained Lean theorem provers mode-collapse at inference time: on miniF2F-test with DeepSeek-Prover-V1.5-RL, doubling the i.i.d.\ sampling budget from $k{=}32$ to $k{=}64$ produces zero additional solved theorems (42/244 in bo…