Researchers have introduced ProgramBench, a new benchmark designed to evaluate the holistic software development capabilities of language models. The benchmark challenges AI agents to architect and implement entire codebases from scratch, given only a program's documentation. Across 200 tasks, including implementing software like FFmpeg and SQLite, none of the nine evaluated language models could fully complete any task, with the best model passing only 3% of tests on average. AI
影响 Highlights current limitations of LLMs in complex software engineering tasks, suggesting further research is needed for autonomous code generation.
排序理由 This is a research paper introducing a new benchmark for evaluating language models in software development.
AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →