English(EN) When Web Agents Finish but Still Fail: Reproducible Triggers and Trace Diagnostics for Parallel Web Exploration

新基准揭示Web代理的隐藏故障模式

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-30 04:00

一篇新的arXiv论文介绍了Parallel WebBench，这是一个旨在更严格地评估Web代理的基准，通过识别最终答案正确性之外的故障。研究表明，即使代理检索到相关证据，仍然存在搜索循环、过早终止和合成崩溃等持续性问题。虽然使用GRPO和合成数据进行训练提高了完成率和部分正确性，但在确保最终答案完全正确并基于证据方面仍存在差距。 AI

影响强调了Web代理可靠性和评估方法改进的关键领域。

排序理由研究论文发布在arXiv上，详细介绍了一个新的基准和对Web代理故障的分析。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Aagam Sogani, Botao Rui, Swetha Vaidyanathan, Rishi Agarwal, Minghao Yan, Shivaram Venkataraman · 2026-06-30 04:00

当Web代理完成但仍失败时：并行Web探索的可复现触发器和跟踪诊断

arXiv:2606.20724v2 Announce Type: replace Abstract: Long-horizon web agents often fail in ways hidden by final-answer evaluation: they may visit useful pages, produce a well-formed answer, and terminate confidently while still missing fields, over-including unsupported items, or …

报道来源 [1]

当Web代理完成但仍失败时：并行Web探索的可复现触发器和跟踪诊断

相关实体

相关话题