PulseAugur / Brief
EN
LIVE 11:26:27

Brief

last 24h
[37/37] 224 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

  1. Evaluating the Robustness of Proof Autoformalization in Lean 4

    A new study on arXiv evaluates the robustness of proof autoformalization models, which translate natural language mathematical proofs into formal languages like Lean 4. Researchers introduced global and local perturbations to informal proofs to test model consistency and faithfulness. The evaluation found that seven recent models were sensitive to global paraphrasing and largely failed to accurately reflect local changes in symbols or proof steps. AI

  2. The Faithfulness Gap: Certifying Semantic Equivalence Between Natural-Language and Formal Mathematical Statements

    Researchers have introduced Bidirectional Provability Fingerprinting (BPF), a new framework designed to certify the faithfulness of autoformalized mathematical statements. This method addresses the challenge where translated formal statements may be provable but not semantically equivalent to the original natural-language intent. The framework includes components for generating counterfactual probes, an equivalence spectrum for continuous scoring, adaptive budget allocation, and faithfulness-guided decoding. A new benchmark, DriftBench, comprising 2,183 NL/Lean 4 pairs, was also released to evaluate these methods. AI

    IMPACT This research aims to improve the reliability of AI systems translating natural language mathematics into formal proofs, potentially increasing trust in AI-assisted mathematical discovery.

  3. AI4SLT: Empirical Processes in Lean 4 for Formal Statistical Learning Theory

    Researchers have developed a formalization of statistical learning theory using Lean 4, a proof assistant, to establish a rigorous foundation for machine learning theory. This project involved a human-AI collaboration where AI agents assisted in constructing proofs for concepts like Gaussian Lipschitz concentration and Dudley's entropy integral theorem. The formalization process has also helped identify and resolve ambiguities in existing statistical learning theory textbooks, creating a reusable toolbox for future research. AI

    IMPACT Establishes a formal, verifiable foundation for machine learning theory, potentially improving rigor and enabling new theoretical advancements.

  4. Identifiability Without Gaussianity: Symbolic World Models and Near-Infinite Temporal Consistency

    A new research paper introduces the Physics-Grounded Symbolic Architecture (PGSA), which overcomes limitations in current statistical World Models. Unlike existing models that require Gaussian dynamics for linear identifiability and temporal consistency, PGSA can achieve exact linear identifiability across all physical regimes. This new architecture also offers near-infinite temporal consistency, meaning its error is bounded only by numerical precision, even for non-Gaussian systems. AI

    IMPACT Introduces a novel architecture that could enable more robust and long-term predictive capabilities in AI systems.

  5. Signed Compression Progress on a Sealed Audit is Goodhart-Resistant

    A new research paper proposes a method called "signed compression progress" as a more robust form of intrinsic motivation for AI agents. This approach aims to ensure that an agent's reward is directly tied to genuine learning and improvement, rather than exploitable metrics. The paper provides a formal proof and experimental evidence demonstrating that this method resists common failure modes like reward clipping and exploitation of easily predictable outcomes. AI

    IMPACT Introduces a theoretically sound method to prevent AI agents from gaming their reward systems, potentially leading to more reliable AI development.

  6. Evaluation of LLMs for Mathematical Formalization in Lean

    A new research paper evaluates the performance of various Large Language Models (LLMs) in generating formal mathematical proofs using the Lean 4 theorem prover. The study employed pass@k and refine@k metrics on subsets of the miniF2F and miniCTX datasets. Gemini 3.1 Pro and Claude Opus 4.7 demonstrated the highest success rates, with Gemini achieving 92% on miniF2F and Opus reaching 86% on miniCTX. For cost-efficiency, NVIDIA Nemotron 3 Super and GPT-OSS 120B offered competitive accuracies at a low cost per proof. AI

    IMPACT This research highlights LLM capabilities in formal mathematics, potentially aiding theorem proving and mathematical research.

  7. Proof-Refactor: Refactoring Generated Formal Proofs into Modular Artifacts

    Researchers have developed new frameworks to enhance formal theorem proving capabilities using large language models. Goedel-Architect utilizes a blueprint generation and refinement strategy, achieving state-of-the-art performance on benchmarks like MiniF2F-test and PutnamBench with the DeepSeek-V4-Flash model. Proof-Refactor focuses on improving the modularity, readability, and maintainability of LLM-generated proofs, outperforming existing baselines on the PutnamBench dataset. Another approach, Compile to Compress, leverages compiler outputs to refine proof attempts efficiently, achieving top results on PutnamBench with smaller models. AI

    IMPACT These advancements in AI-driven formal theorem proving could accelerate mathematical discovery and software verification.

  8. Show HN: Formally verified polygon intersection – Opus 4.8 oneshots, prev failed

    A developer has created a formally verified implementation for polygon intersection, a standard feature in vector graphics editors. This project utilized AI agents, with recent models capable of generating algorithm implementations and formal proofs in a single step, a significant improvement over previous multi-step processes. The correctness of the algorithm is guaranteed by the Lean proof assistant and human review of a concise specification, not solely by the AI model. AI

    IMPACT Demonstrates AI's growing capability in assisting with formal verification tasks, potentially accelerating the development of reliable software.

  9. Lean-GAP: A Dataset of Formalized Graduate Algebra Problems

    Researchers have developed Lean-GAP, a dataset containing 430 formalized graduate-level algebra problems derived from the textbook "Abstract Algebra" by Dummit and Foote. The process involved a pipeline for PDF-to-LaTeX preprocessing and autoformalization into Lean 4, though human oversight was crucial for verification. This work contributes a structured dataset, a methodology for formalizing mathematical texts, and an analysis of challenges in translating informal statements to formal language, including comparisons of autoformalization models. AI

    IMPACT Formalizing complex mathematical texts could enable more robust AI reasoning and verification in advanced academic domains.

  10. FVSpec: Real-World Property-Based Tests as Lean Challenges

    Researchers have developed a new benchmark called FVSpec to evaluate AI models on formal software verification tasks. The benchmark was created by translating over 2,700 real-world Python property-based tests into more than 9,400 specifications in the Lean 4 proof assistant language. This process involved modeling Python semantics and inferring logical properties, presenting significant challenges due to the complexity of dependent-type programming. The project aims to advance AI-assisted formal verification, a field gaining importance as AI contributes more to software development. AI

    IMPACT This benchmark could drive progress in AI-assisted formal verification, a critical area for ensuring the reliability of AI-generated code.

  11. Automated Conjecture Resolution with Formal Verification

    Researchers have developed a novel framework that merges informal reasoning with formal verification to tackle complex mathematical problems. This system, comprising an informal agent named Rethlas and a formal agent called Archon, utilizes theorem search and automated proof synthesis to ensure machine-checkable correctness. The framework successfully resolved an open problem in commutative algebra and formally verified the proof with minimal human intervention, showcasing a promising path for AI-assisted mathematical discovery and collaboration. AI

    IMPACT Demonstrates a new paradigm for AI to assist in solving and verifying complex mathematical research problems, potentially accelerating discovery.

  12. Expected Value Alignment for Generative Reward Modeling in Formal Mathematics Verification

    Researchers have introduced Expected Value Alignment (EVA), a new procedure for training reward models used with large language models in formal mathematics verification. EVA addresses a trade-off in existing models by extracting continuous scores from a model's token distribution while preserving discrete textual rationales. This method was implemented in a model called Leibniz for Lean 4 formal verification, showing reduced discretization artifacts compared to baseline approaches. AI

    IMPACT This new method could improve the accuracy and interpretability of AI systems used in formal mathematical reasoning.

  13. ProofWala: A Framework for Multilingual Proof Data Synthesis and Theorem-Proving

    Researchers have developed ProofWala, a new framework designed to facilitate multilingual proof data synthesis and theorem-proving for neural approaches. This framework includes a reusable library for interacting with interactive theorem provers (ITPs) and supports project-wide analysis and parallel experimentation. By training models multilingually across different ITPs like Lean 4 and Rocq, the system demonstrates improved cross-lingual and cross-domain transfer capabilities, showing statistically significant gains in specific mathematical domains. AI

    IMPACT Enables more robust and scalable research in formal verification and automated theorem-proving by facilitating multilingual data synthesis and cross-lingual transfer.

  14. On Compositional Learning Behaviours in Formal Mathematics

    Researchers have developed AutoformBot, a multi-agent system that uses LLMs and formal verification tools to translate informal mathematical prose into machine-checked code. This system has been applied to 26 mathematics textbooks, resulting in a verified library of over 45,000 declarations and 500,000 lines of code in Lean 4. Separately, a study on compositional learning behaviors in formal mathematics found that while search-heavy models can handle basic tasks, achieving Olympiad-level performance requires strong compositional learning capabilities, which are necessary but not sufficient for complex mathematical verification. AI

    IMPACT Advances in AI's ability to formalize complex mathematical reasoning could accelerate AI-driven scientific discovery and theorem proving.

  15. FVSpec: Real-World Property-Based Tests as Lean Challenges

    Researchers have introduced FVSpec, a new benchmark designed to evaluate AI models and agents in formal software verification tasks. The benchmark involves translating property-based tests from Python into specifications using a multi-agent LLM pipeline. This process aims to address the challenges of modeling Python semantics and inferring logical properties within the Lean 4 programming language, with the goal of advancing AI-assisted formal verification for real-world software. AI

    IMPACT This benchmark aims to drive progress in AI-assisted formal verification, a critical area as AI contributes more to software development.

  16. MerLean-Prover: A Recursive Looping Harness for End-to-End Lean 4 Theorem Proving

    Researchers have developed MerLean-Prover, an end-to-end theorem prover for Lean 4 that generates kernel-checkable proofs. The system utilizes a recursive loop with three agent types (Planning, Check, and Lean) and has demonstrated strong performance on benchmarks like FormalQualBench and Putnam2025. Notably, MerLean-Prover achieved 10/23 on FormalQualBench, outperforming existing open-source baselines, and successfully solved all 12 problems on Putnam2025 with reduced computation time. The harness design also proved effective with smaller models, including Sonnet and Haiku. AI

  17. TorchLean: Formalizing Neural Networks in Lean

    Researchers have developed TorchLean, a framework that formalizes neural networks within the Lean 4 theorem prover. This system allows for the execution and verification of neural networks directly within the same environment where mathematical proofs are conducted. TorchLean supports various neural network components, including attention mechanisms and diffusion models, and offers features for exact and finite-precision tensor semantics, differentiation, and bound propagation. AI

    IMPACT Enables formal verification of neural networks, crucial for safety-critical applications.

  18. Lean Formalization of Generalization Error Bound by Rademacher Complexity and Dudley's Entropy Integral

    Researchers have formalized generalization error bounds using Rademacher complexity in the Lean 4 proof assistant. This work builds upon measure-theoretic probability theory within the Mathlib library. The formalization includes a mechanically-checked pipeline from definitions to high-probability uniform deviation bounds via a proved McDiarmid inequality, with applications to linear predictors and Dudley-type entropy integral bounds. AI

    IMPACT Provides a mechanically-verified foundation for understanding machine learning model generalization, potentially improving trust and reliability in theoretical guarantees.

  19. Keep the Proof State Live: Snapshotting for Efficient Tactic Search in Lean 4

    Researchers have developed a new method called proof-state snapshotting to significantly speed up automated theorem proving in Lean 4. This technique addresses the inefficiency of repeatedly reconstructing proof states during parallel tactic search, which is a bottleneck in current systems. By capturing and reusing elaborated proof states, the new approach offers substantial wall-time speedups, particularly as the number of search branches increases. AI

    IMPACT This technique could enable more scalable and efficient automated reasoning systems, potentially accelerating AI development in formal verification and mathematical discovery.

  20. ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization

    Researchers have developed ImProver 2, a neurosymbolic framework designed to optimize formal mathematical proofs within the Lean 4 environment. This system employs an expert-iteration pipeline and a scaffold that integrates formal structure with informal abstractions to address challenges like heterogeneous objectives and high computational costs. A 7B-parameter model trained with ImProver 2 has demonstrated performance competitive with larger frontier models and significantly improved efficiency across various metrics, suggesting proof optimization is a scalable and learnable task. AI

    IMPACT Demonstrates that smaller AI models, when properly trained and scaffolded, can effectively restructure complex research-level proofs, potentially making formal mathematics more accessible.

  21. Decompose, Structure, and Repair: A Neuro-Symbolic Framework for Autoformalization via Operator Trees

    Researchers have developed a new neuro-symbolic framework called Decompose, Structure, and Repair (DSR) to improve the process of autoformalization, which translates natural language mathematical statements into formal code. Unlike previous methods that treated formal code as flat sequences, DSR breaks down statements into logical components and maps them to structured operator trees. This approach allows for more precise error localization and repair through sub-tree refinement. The framework was evaluated on a new benchmark called PRIME, consisting of 156 theorems, and demonstrated state-of-the-art performance. AI

    IMPACT Introduces a novel neuro-symbolic approach to autoformalization, potentially improving the reliability and efficiency of translating mathematical language into formal code.

  22. A Formally Verified Library of Mathematical Finance in Lean 4

    Researchers have developed a comprehensive library of mathematical finance theorems using the Lean 4 proof assistant. This library, built upon Mathlib and the BrownianMotion package, includes over two hundred theorems covering a wide range of topics from stochastic calculus to portfolio theory. A key feature is its faithfulness audit, which precisely documents the axioms used for each proof, ensuring transparency and verifiability. The project's contribution is primarily methodological, providing reusable, verified foundations for mathematical finance rather than new financial theories. AI

    IMPACT Provides verified foundations for mathematical finance, potentially improving reliability in quantitative finance applications.

  23. The Attribution Impossibility: No Feature Ranking Is Faithful, Stable, and Complete Under Collinearity

    A new research paper published on arXiv demonstrates that no feature ranking method can be simultaneously faithful, stable, and complete when features are collinear. The study proves this impossibility and quantifies it across various model classes, suggesting that ensemble averaging methods like DASH can resolve this issue. The findings have direct implications for fairness auditing, indicating that SHAP-based proxy discrimination audits are unreliable under collinearity. AI

    IMPACT Highlights fundamental limitations in current explainable AI methods, impacting fairness audits and model interpretability.

  24. Group-Algebraic Tensors: Provably-optimal Equivariant Learning and Physical Symmetry Discovery

    Researchers have developed a new tensor algebra framework called $\star_G$ that intrinsically embeds equivariance, allowing for symmetry-preserving tensor approximation and physical symmetry discovery. This framework offers a closed-form decomposition of predictions per irreducible representation and can identify the underlying symmetry group from data alone. Empirical demonstrations on molecular geometry data show significant parameter reduction compared to standard MLPs while achieving comparable predictive power. AI

    IMPACT Introduces a novel algebraic approach to incorporate physical symmetries into machine learning models, potentially enabling more efficient and interpretable AI for scientific discovery.

  25. Using Aristotle API for AI-Assisted Theorem Proving in Lean 4: A Formalisation Case Study of the Grasshopper Problem

    This paper details a case study using the Aristotle API for AI-assisted theorem proving within the Lean 4 formalization environment. The study focused on the Grasshopper problem, a challenge from IMO 2009. While the AI generated verified lemmas for local proof components, it left the main theorem unresolved, highlighting a limitation in AI's ability to handle global combinatorial bookkeeping required for complex mathematical proofs. AI

    Using Aristotle API for AI-Assisted Theorem Proving in Lean 4: A Formalisation Case Study of the Grasshopper Problem

    IMPACT Demonstrates current limitations of AI in complex mathematical formalization, particularly in global combinatorial reasoning.

  26. Interpretable epistemic uncertainty decomposition in sequential generative models via polynomial chaos surrogates

    Researchers have developed a new method to decompose epistemic uncertainty in sequential generative models, particularly those used in AI-driven scientific discovery. By fitting polynomial chaos expansions to ensembles of trained models, the approach provides an interpretable breakdown of how reward uncertainties influence generative decisions. This technique offers actionable insights into complex datasets, outperforming traditional methods like deep ensembles and Bayesian neural networks in identifying sensitive and robust components across various scientific tasks. AI

    Interpretable epistemic uncertainty decomposition in sequential generative models via polynomial chaos surrogates

    IMPACT Provides a novel framework for understanding and interpreting uncertainty in AI models used for scientific discovery, potentially leading to more robust and reliable AI-driven research.

  27. Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

    Researchers have introduced Formal Conjectures, a new benchmark designed to evaluate automated reasoning systems in mathematics. This evolving dataset, formalized in Lean 4, comprises over 2600 mathematical problem statements, including 1029 open research conjectures and 836 solved problems. The benchmark facilitates collaboration between mathematicians and AI systems, and has already contributed to resolving open conjectures, demonstrating its potential for advancing AI-driven mathematical discovery. AI

    Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

    IMPACT Advances AI capabilities in formal mathematics and aids in discovering new mathematical proofs.

  28. FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

    Researchers have introduced FormalRewardBench, a new benchmark designed to evaluate reward models used in formal theorem proving. This benchmark addresses the challenge of sparse credit assignment in reinforcement learning for theorem provers by enabling the comparison of reward models without extensive retraining. FormalRewardBench includes 250 preference pairs with various error injection strategies and has been used to test several large language models, revealing that frontier models perform best in evaluating proof quality. AI

    FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

    IMPACT This benchmark aims to improve reward models for AI theorem provers, potentially leading to more capable AI systems in formal mathematics and complex reasoning tasks.

  29. CktFormalizer: Autoformalization of Natural Language into Circuit Representations

    Researchers have developed CktFormalizer, a framework that uses Lean 4 to improve the generation of hardware descriptions from natural language by large language models. This system employs dependent types to catch common hardware defects like width mismatches and incomplete logic as compile-time errors, ensuring greater correctness. CktFormalizer not only achieves competitive simulation pass rates but also significantly enhances backend realizability, with optimized designs showing substantial reductions in area and power while maintaining functional equivalence. AI

    CktFormalizer: Autoformalization of Natural Language into Circuit Representations

    IMPACT Enhances the reliability and efficiency of LLM-driven hardware design, potentially accelerating chip development.

  30. Discovering New Theorems via LLMs with In-Context Proof Learning in Lean

    Researchers have developed a new pipeline called the Conjecturing-Proving Loop (CPL) that uses Large Language Models (LLMs) to discover new mathematical theorems and generate formal proofs in Lean 4. CPL iteratively creates conjectures and attempts to prove them, leveraging previously generated theorems and proofs for in-context learning. This approach demonstrates improved discovery rates for complex theorems compared to simultaneous statement and proof generation, highlighting the effectiveness of self-generated context for neural theorem proving. AI

    Discovering New Theorems via LLMs with In-Context Proof Learning in Lean

    IMPACT Introduces a novel method for LLMs to discover mathematical theorems, potentially accelerating formal verification and mathematical research.

  31. Automated Formal Proofs of Combinatorial Identities via Wilf-Zeilberger Guidance and LLMs

    Researchers have developed WZ-LLM, a novel neuro-symbolic framework that combines the Wilf-Zeilberger (WZ) method with large language models (LLMs) to automate formal proofs of combinatorial identities. This approach translates WZ proof plans into executable sketches in Lean 4, leveraging an LLM-based prover for subgoals. Experiments demonstrate that WZ-LLM achieves a 34% success rate on the LCI-Test dataset, surpassing existing methods like DeepSeek-V3 and Goedel-Prover-V2. AI

    Automated Formal Proofs of Combinatorial Identities via Wilf-Zeilberger Guidance and LLMs

    IMPACT This research could accelerate formal verification in mathematics and computer science by improving automated theorem proving capabilities.

  32. Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game

    Researchers have developed a new method to jailbreak large language models by exploiting their safe completion mechanisms through deceptive multi-turn conversations. This technique, termed intention deception, gradually builds trust by simulating benign intentions, ultimately guiding models like GPT-5 and Claude-Sonnet-4.5 towards generating harmful outputs. The study also identified a new vulnerability called para-jailbreaking, where models reveal harmful information indirectly, and demonstrated the method's effectiveness on multimodal vision-language models. AI

    Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game

    IMPACT New jailbreaking techniques highlight the ongoing challenges in AI safety and the need for more robust alignment strategies.

  33. "The Network Structure of Mathlib" Post by Simone Severini on LinkedIn: 2/2 ===== ... We read Mathlib as a transitional object, i.e., mathematics in the middle

    A new paper analyzes Mathlib, the largest formalized mathematics library in Lean 4, by treating it as a network. Researchers found that the library's organizational structure, based on folders and naming conventions, does not align with the actual mathematical dependencies between theorems. The study also revealed that a significant portion of logical dependencies cross naming boundaries and that many connections are implicitly generated by the compiler rather than explicitly written by humans. Furthermore, the network analysis indicates that the most frequently used element is the reflexivity of equality, rather than mathematically deeper theorems like the Chinese Remainder Theorem. AI

    "The Network Structure of Mathlib" Post by Simone Severini on LinkedIn: 2/2 ===== ... We read Mathlib as a transitional object, i.e., mathematics in the middle

    IMPACT Provides a framework for analyzing formalized mathematics libraries, potentially informing future AI-generated proof systems.

  34. Surface Sensitivity in Lean 4 Autoformalization

    Researchers have investigated the impact of natural language variations on Lean 4 autoformalization, finding that semantically equivalent paraphrases can lead to different formal outputs. Their study, using GPT-family models and open-weight autoformalizers on ProofNet# and miniF2F datasets, revealed that these sensitivities are primarily due to compilation failures rather than semantic disagreements. The findings suggest that future efforts should focus on improving the compilation process rather than the semantic layer of these systems. AI

    Surface Sensitivity in Lean 4 Autoformalization

    IMPACT Suggests focusing training on compilation rather than semantic layers for autoformalization tools.

  35. OptProver: Bridging Olympiad and Optimization through Continual Training in Formal Theorem Proving

    Researchers have developed OptProver, a novel AI model designed to tackle formal theorem proving in undergraduate optimization problems. This model builds upon existing provers trained on Olympiad-level mathematics, adapting them to the distinct formalisms of optimization. OptProver utilizes large-scale data curation and a specialized preference learning objective to improve its performance and efficiency in generating proofs. AI

    OptProver: Bridging Olympiad and Optimization through Continual Training in Formal Theorem Proving

    IMPACT Introduces a new benchmark and model for formal theorem proving in optimization, potentially advancing AI's capabilities in mathematical reasoning.

  36. Benchmarking Testing in Automated Theorem Proving

    Researchers have developed a new framework called T to evaluate the semantic correctness of theorems generated by large language models in automated theorem proving. This approach, inspired by code generation testing, verifies theorems by checking if dependent successor theorems compile successfully. Experiments using T on real-world Lean 4 repositories revealed that while current models like Claude-Sonnet-4.5 can compile generated theorems, their semantic accuracy is significantly lower, highlighting a gap in their theorem generation capabilities. AI

    Benchmarking Testing in Automated Theorem Proving

    IMPACT Introduces a novel semantic evaluation metric for LLM-generated theorems, revealing significant performance gaps in current models.

  37. Show HN: Formal Verification for Machine Learning Models Using Lean 4

    A new open-source framework called FormalVerifML has been released, utilizing Lean 4 for the formal verification of machine learning models. This tool aims to provide mathematically rigorous proofs of properties like robustness, fairness, and safety for high-stakes applications. It supports large-scale models, including transformers and vision models, with features for enterprise use and distributed verification. AI

    Show HN: Formal Verification for Machine Learning Models Using Lean 4

    IMPACT Enhances trust and reliability in ML models for critical applications through formal verification.