New benchmark CORE-Bench tests AI agents' scientific reproducibility

By PulseAugur Editorial · [1 sources] · 2026-06-24 04:00

Researchers have introduced CORE-Bench, a new benchmark designed to evaluate the ability of AI agents to perform computational reproducibility tasks. This benchmark comprises 270 tasks derived from 90 scientific papers across computer science, social science, and medicine, with varying difficulty levels and including both language-only and vision-language challenges. An evaluation system was developed to speed up the assessment process, and baseline agents like AutoGPT and a specialized CORE-Agent were tested using GPT-4o and GPT-4o-mini models. The best-performing agent achieved 21% accuracy on the most difficult tasks, highlighting significant room for improvement in automating scientific research processes. AI

IMPACT This benchmark aims to advance AI agents' capabilities in scientific research, potentially accelerating discovery and verification processes.

RANK_REASON The cluster contains a research paper introducing a new benchmark for AI agents. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New benchmark CORE-Bench tests AI agents' scientific reproducibility

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Zachary S. Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, Arvind Narayanan · 2026-06-24 04:00

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

arXiv:2409.11363v2 Announce Type: replace-cross Abstract: AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially…

COVERAGE [1]

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

RELATED ENTITIES

RELATED TOPICS