English(EN) CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

新基准 CORE-Bench 测试 AI 代理的科学可复现性

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-24 04:00

研究人员推出了 CORE-Bench，这是一个旨在评估 AI 代理执行计算可复现性任务能力的新基准。该基准包含 270 个任务，源自计算机科学、社会科学和医学领域的 90 篇科学论文，具有不同的难度级别，并包括纯语言和视觉语言挑战。开发了一个评估系统来加速评估过程，并使用 GPT-4o 和 GPT-4o-mini 模型测试了 AutoGPT 和专门的 CORE-Agent 等基线代理。表现最好的代理在最困难的任务上达到了 21% 的准确率，这表明在自动化科学研究过程方面仍有很大的改进空间。 AI

影响该基准旨在提升 AI 代理在科学研究中的能力，有望加速发现和验证过程。

排序理由该集群包含一篇介绍 AI 代理新基准的研究论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Zachary S. Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, Arvind Narayanan · 2026-06-24 04:00

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

arXiv:2409.11363v2 Announce Type: replace-cross Abstract: AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially…

报道来源 [1]

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

相关实体

相关话题