ProgramBench benchmark finds language models struggle to build software from scratch

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 2 sources

Researchers have introduced ProgramBench, a new benchmark designed to evaluate the holistic software development capabilities of language models. The benchmark challenges AI agents to architect and implement entire codebases from scratch, given only a program's documentation. Across 200 tasks, including implementing software like FFmpeg and SQLite, none of the nine evaluated language models could fully complete any task, with the best model passing only 3% of tests on average. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Highlights current limitations of LLMs in complex software engineering tasks, suggesting further research is needed for autonomous code generation.

RANK_REASON This is a research paper introducing a new benchmark for evaluating language models in software development.

Read on arXiv cs.AI →

paper
other

COVERAGE [2]

arXiv cs.AI TIER_1 · John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, Ofir Press · 2026-05-07 04:00

ProgramBench: Can Language Models Rebuild Programs From Scratch?

arXiv:2605.03546v1 Announce Type: cross Abstract: Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such set…
arXiv cs.AI TIER_1 · Ofir Press · 2026-05-05 09:17

ProgramBench: Can Language Models Rebuild Programs From Scratch?

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software a…

COVERAGE [2]

ProgramBench: Can Language Models Rebuild Programs From Scratch?

ProgramBench: Can Language Models Rebuild Programs From Scratch?

RELATED ENTITIES

RELATED TOPICS