Search-Time Contamination in Deep Research Agents: Measuring Performance Inflation in Public Benchmark Evaluation
A new research paper identifies a problem called Search-Time Contamination (STC) in deep research agents that use web search for evaluation. This contamination occurs when agents retrieve benchmark metadata, question context, or answers from the web, artificially inflating their performance. The study found STC can inflate performance by up to 4% and advocates for contamination-aware evaluation practices like isolated sandboxes and controlled benchmark access. AI
IMPACT Highlights potential overestimation of AI reasoning abilities, necessitating new evaluation standards for research agents.