English(EN) Search-Time Contamination in Deep Research Agents: Measuring Performance Inflation in Public Benchmark Evaluation

研究论文警告“搜索时污染”会夸大AI代理基准测试结果

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-06 04:00

一篇新研究论文指出了深度研究代理中存在的一个问题，称为搜索时污染（STC），这些代理在评估中使用网络搜索。当代理从网络检索基准元数据、问题上下文或答案时，就会发生这种污染，从而人为地夸大了它们的性能。研究发现STC可以将性能夸大高达4%，并提倡采用防污染的评估实践，例如隔离的沙箱和受控的基准访问。 AI

影响凸显了AI推理能力可能被高估的问题，需要为研究代理制定新的评估标准。

排序理由该集群包含一篇学术论文，详细介绍了关于AI评估的新研究发现。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Yongjie Wang, Xinyue Zhang, Kunhong Yao, Zhiwei Zeng, Kaisong Song, Jun Lin, Zhiqi Shen · 2026-06-06 04:00

深度研究代理中的搜索时污染：衡量公开基准评估中的性能膨胀

arXiv:2606.05241v1 Announce Type: cross Abstract: Public benchmarks enable fair and reproducible evaluation of LLM reasoning, but they become fragile for deep research agents that actively search the web during inference. Such agents may retrieve public benchmark metadata, questi…

报道来源 [1]

深度研究代理中的搜索时污染：衡量公开基准评估中的性能膨胀

相关实体

相关话题