A new benchmark called Ko-WideSearch has been developed to evaluate the breadth-search capabilities of web agents, focusing on exhaustive set enumeration rather than depth-based question answering. This Korean-language benchmark, constructed via an automated pipeline, comprises 228 tables across 190 entities and sixteen categories. Initial testing with twenty web agents revealed consistent failures in accurately recovering row-level attributes, even when the overall set membership was correctly identified, indicating a significant challenge for current AI systems. AI
IMPACT Highlights a critical gap in web agent capabilities, suggesting current models need improvement in structured data extraction and exhaustive enumeration.
RANK_REASON The cluster describes a new academic benchmark for evaluating AI capabilities, published on arXiv.
Read on Hugging Face Daily Papers →
AI-generated summary · Google Gemini · from 3 sources. How we write summaries →