A new paper questions the reliability of temporal signals in detecting benchmark contamination for large language models. Researchers found that the way benchmark questions are phrased significantly impacts whether performance decay appears over time. LLM-generated questions can mask contamination that is evident with simpler question formats, suggesting current detection methods may be insufficient for accurate AI evaluation. AI
影响 Highlights the need for more robust methods to detect benchmark contamination, crucial for reliable AI evaluation.
排序理由 Academic paper analyzing AI evaluation methodology.
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →