AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research
Researchers have developed AblationBench, a new benchmark suite designed to evaluate the ability of AI agents to plan ablation experiments in empirical AI research. The benchmark includes two tasks: one for authors to propose ablations based on method sections and another for reviewers to identify missing ablations in full papers. Current frontier language models struggle with these tasks, achieving less than human-level performance, with the best models identifying only about 45% of necessary ablations. AI
IMPACT This benchmark could drive improvements in AI's ability to assist in scientific research by identifying gaps in experimental design.