PulseAugur
实时 09:47:31

ProgramBench benchmark finds language models struggle to build software from scratch

Researchers have introduced ProgramBench, a new benchmark designed to evaluate the holistic software development capabilities of language models. The benchmark challenges AI agents to architect and implement entire codebases from scratch, given only a program's documentation. Across 200 tasks, including implementing software like FFmpeg and SQLite, none of the nine evaluated language models could fully complete any task, with the best model passing only 3% of tests on average. AI

影响 Highlights current limitations of LLMs in complex software engineering tasks, suggesting further research is needed for autonomous code generation.

排序理由 This is a research paper introducing a new benchmark for evaluating language models in software development.

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

ProgramBench benchmark finds language models struggle to build software from scratch

报道来源 [2]

  1. arXiv cs.AI TIER_1 English(EN) · John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, Ofir Press ·

    ProgramBench: Can Language Models Rebuild Programs From Scratch?

    arXiv:2605.03546v1 Announce Type: cross Abstract: Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such set…

  2. arXiv cs.AI TIER_1 English(EN) · Ofir Press ·

    ProgramBench: Can Language Models Rebuild Programs From Scratch?

    Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software a…