AI Evaluation Should Require Standardized Item-Level Data Releases
A new position paper advocates for standardized item-level data releases in AI evaluations to improve transparency and replicability. The authors argue that current aggregate scores obscure critical issues like underspecified item selection and construct misalignment, leading to inflated capability claims and misplaced trust. To address this, they propose treating item-level data as core infrastructure and introduce OpenEval, an archive of 10 million responses across numerous benchmarks, designed to facilitate deeper analysis and validation of AI evaluations. AI
IMPACT Standardizing AI evaluation data could lead to more trustworthy benchmark results and better-informed decisions about deployed systems.