Researchers have introduced CORE-Bench, a new benchmark designed to evaluate the ability of AI agents to perform computational reproducibility tasks. This benchmark comprises 270 tasks derived from 90 scientific papers across computer science, social science, and medicine, with varying difficulty levels and including both language-only and vision-language challenges. An evaluation system was developed to speed up the assessment process, and baseline agents like AutoGPT and a specialized CORE-Agent were tested using GPT-4o and GPT-4o-mini models. The best-performing agent achieved 21% accuracy on the most difficult tasks, highlighting significant room for improvement in automating scientific research processes. AI
IMPACT This benchmark aims to advance AI agents' capabilities in scientific research, potentially accelerating discovery and verification processes.
RANK_REASON The cluster contains a research paper introducing a new benchmark for AI agents. [lever_c_demoted from research: ic=1 ai=1.0]
- AI agents
- AutoGPT
- computational reproducibility
- computer science
- CORE-Agent
- CORE-Bench
- GPT-4o
- GPT-4o-mini
- medicine
- social science
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →