AI benchmark contamination signal sensitive to question format, study finds

By PulseAugur Editorial · [1 sources] · 2026-04-28 04:00

A new paper questions the reliability of temporal signals in detecting benchmark contamination for large language models. Researchers found that the way benchmark questions are phrased significantly impacts whether performance decay appears over time. LLM-generated questions can mask contamination that is evident with simpler question formats, suggesting current detection methods may be insufficient for accurate AI evaluation. AI

IMPACT Highlights the need for more robust methods to detect benchmark contamination, crucial for reliable AI evaluation.

RANK_REASON Academic paper analyzing AI evaluation methodology.

Read on arXiv cs.AI →

paper
safety

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Terry Jingchen Zhang, Gopal Dev, Ning Wang, Max Obreiter, Punya Syon Pandey, Keenan Samway, Wenyuan Jiang, Yinya Huang, Bernhard Sch\"olkopf, Mrinmaya Sachan, Zhijing Jin · 2026-04-28 04:00

Test of Time: Rethinking Temporal Signal of Benchmark Contamination

arXiv:2509.00072v3 Announce Type: replace Abstract: Post-cutoff performance decay has been widely interpreted as a temporal signal for benchmark contamination. We critically examine this belief and demonstrate that this temporal signal is highly sensitive to how benchmark questio…

COVERAGE [1]

Test of Time: Rethinking Temporal Signal of Benchmark Contamination

RELATED ENTITIES

RELATED TOPICS