Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 2d

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

A new research paper highlights significant issues with current methods for detecting benchmark contamination in large language models. The study, which evaluated 27 models including frontier industry ones, found that common statistical tools fail under realistic conditions like distribution shift and scale differences between benchmarks and training data. These tools produced incorrect outcomes in over 40% of evaluations, indicating that current detection methods are unreliable for practical benchmark auditing and cannot replace transparent data provenance. AI

IMPACT Current methods for detecting benchmark contamination are unreliable, necessitating new approaches for valid LLM evaluation.

Pythia
OLMo
Wojciech Zarzecki