Researchers have introduced ProgramBench, a new benchmark designed to evaluate the holistic software development capabilities of language models. The benchmark challenges AI agents to architect and implement entire codebases from scratch, given only a program's documentation. Across 200 tasks, including implementing software like FFmpeg and SQLite, none of the nine evaluated language models could fully complete any task, with the best model passing only 3% of tests on average. AI
Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →
IMPACT Highlights current limitations of LLMs in complex software engineering tasks, suggesting further research is needed for autonomous code generation.
RANK_REASON This is a research paper introducing a new benchmark for evaluating language models in software development.