A new arXiv paper introduces Parallel WebBench, a benchmark designed to evaluate web agents more rigorously by identifying failures beyond just final answer correctness. The study reveals persistent issues such as search loops, premature termination, and synthesis collapse, even when agents retrieve relevant evidence. While training with GRPO and synthetic data improved completion rates and partial correctness, a gap remains in ensuring the final answer is fully correct and grounded in the evidence. AI
IMPACT Highlights critical areas for improvement in web agent reliability and evaluation methodologies.
RANK_REASON Research paper published on arXiv detailing a new benchmark and analysis of web agent failures. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →