A new position paper advocates for standardized item-level data releases in AI evaluations to improve transparency and replicability. The authors argue that current aggregate scores obscure critical issues like underspecified item selection and construct misalignment, leading to inflated capability claims and misplaced trust. To address this, they propose treating item-level data as core infrastructure and introduce OpenEval, an archive of 10 million responses across numerous benchmarks, designed to facilitate deeper analysis and validation of AI evaluations. AI
Summary written by gemini-2.5-flash-lite from 1 sources. How we write summaries →
IMPACT Standardizing AI evaluation data could lead to more trustworthy benchmark results and better-informed decisions about deployed systems.
RANK_REASON The cluster contains a research paper proposing a new methodology for AI evaluation. [lever_c_demoted from research: ic=1 ai=1.0]