New benchmark reveals AI struggles with verified code generation

By PulseAugur Editorial · [1 sources] · 2026-06-04 04:00

A new benchmark called AlgoVeri has been developed to evaluate the performance of AI models in generating formally verified code for classical algorithms. The benchmark tests models across three languages: Dafny, Verus, and Lean, revealing significant capability gaps. While Gemini-3 Flash shows moderate success in Dafny, its performance drops considerably in Verus and Lean, highlighting challenges with memory constraints and explicit proof construction. AI

IMPACT Highlights limitations in current AI models for generating formally verified code, suggesting areas for future research and development in formal verification tools.

RANK_REASON The cluster contains an academic paper introducing a new benchmark for evaluating AI models on a specific task. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Haoyu Zhao, Ziran Yang, Jiawei Li, Deyuan He, Zenan Li, Chi Jin, Venugopal V. Veeravalli, Aarti Gupta, Sanjeev Arora · 2026-06-04 04:00

AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

arXiv:2602.09464v2 Announce Type: replace-cross Abstract: Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmar…

COVERAGE [1]

AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

RELATED ENTITIES

RELATED TOPICS