新研究评估LLM对代码编写之外的软件执行的理解

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-29 04:00

一篇新论文介绍了一种评估编码LLM的隐式软件世界模型的方法，该方法超越了简单的控制流，以评估内存和执行时间等资源使用情况。研究使用SWE-bench Verified数据发现，即使是先进的模型也表现出对软件执行的理解有限，这表明与它们的代码编写能力相比，它们在推理能力方面存在差距。 AI

影响这项研究突出了当前LLM对软件执行理解的局限性，表明需要超越代码生成的改进评估方法。

排序理由该集群包含一篇详细介绍LLM新评估方法的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Egor Bogomolov, Yaroslav Zharov · 2026-06-29 04:00

Towards Evaluation of Implicit Software World Models in Coding LLMs

arXiv:2606.27406v1 Announce Type: cross Abstract: Software engineering, whether performed by humans or by AI agents, requires reasoning about how software behaves. We call the internal model that supports such reasoning the software world model, and view current code-execution be…