English(EN) VeriScale: Adversarial Test-Suite Scaling for Verifiable Code Generation

新框架VeriScale改进了LLM代码生成基准测试

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-22 04:00

研究人员开发了VeriScale，一个旨在为评估大型语言模型生成的代码创建更鲁棒的基准测试的新框架。该框架使用对抗性方法来扩展然后缩减测试套件，从而揭示出简单基准测试可能忽略的模型弱点。在Verina基准测试上使用VeriScale进行的实验显示，最先进的LLM的性能显著下降，突显了当前评估方法的局限性。 AI

影响增强了LLM生成代码的评估严谨性，可能带来更可靠的软件开发工具。

排序理由该集群包含一篇学术论文，详细介绍了用于评估LLM代码生成的新框架。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.LG TIER_1 English(EN) · Yifan Bai, Xiaoyang Liu, Zihao Mou, Guihong Wang, Jian Yu, Shuhan Xie, Yantao Li, Yangyu Zhang, Jingwei Liang, Tao Luo · 2026-05-22 04:00

VeriScale: Adversarial Test-Suite Scaling for Verifiable Code Generation

arXiv:2605.22368v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed for software engineering, constructing high-quality benchmarks is crucial for evaluating not just the functional correctness, but also the formal verifiability of generated c…

报道来源 [1]

VeriScale: Adversarial Test-Suite Scaling for Verifiable Code Generation

相关实体

相关话题