Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations
A new paper proposes a Bayesian inference framework to audit public archives of frontier AI evaluations. The research highlights how selective reporting and benchmark revisions can distort the perception of AI progress, using LiveBench and Open LLM Leaderboard v2 as primary examples. The proposed archive-and-adjudication protocol aims to reconstruct evaluation histories, establish verified timing boundaries, and invalidate unsubstantiated claims about AI capabilities. AI
IMPACT Proposes a new framework for auditing AI evaluation data, potentially improving the transparency and reliability of benchmark results.