A new research paper identifies a problem called Search-Time Contamination (STC) in deep research agents that use web search for evaluation. This contamination occurs when agents retrieve benchmark metadata, question context, or answers from the web, artificially inflating their performance. The study found STC can inflate performance by up to 4% and advocates for contamination-aware evaluation practices like isolated sandboxes and controlled benchmark access. AI
IMPACT Highlights potential overestimation of AI reasoning abilities, necessitating new evaluation standards for research agents.
RANK_REASON The cluster contains an academic paper detailing a new research finding about AI evaluation. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →