PulseAugur
实时 04:37:20

AI benchmark contamination signal sensitive to question format, study finds

A new paper questions the reliability of temporal signals in detecting benchmark contamination for large language models. Researchers found that the way benchmark questions are phrased significantly impacts whether performance decay appears over time. LLM-generated questions can mask contamination that is evident with simpler question formats, suggesting current detection methods may be insufficient for accurate AI evaluation. AI

影响 Highlights the need for more robust methods to detect benchmark contamination, crucial for reliable AI evaluation.

排序理由 Academic paper analyzing AI evaluation methodology.

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

AI benchmark contamination signal sensitive to question format, study finds

报道来源 [1]

  1. arXiv cs.AI TIER_1 English(EN) · Terry Jingchen Zhang, Gopal Dev, Ning Wang, Max Obreiter, Punya Syon Pandey, Keenan Samway, Wenyuan Jiang, Yinya Huang, Bernhard Sch\"olkopf, Mrinmaya Sachan, Zhijing Jin ·

    Test of Time: Rethinking Temporal Signal of Benchmark Contamination

    arXiv:2509.00072v3 Announce Type: replace Abstract: Post-cutoff performance decay has been widely interpreted as a temporal signal for benchmark contamination. We critically examine this belief and demonstrate that this temporal signal is highly sensitive to how benchmark questio…