PulseAugur
EN
LIVE 17:20:25

New Ko-WideSearch benchmark reveals web agents struggle with breadth-search tasks

A new benchmark called Ko-WideSearch has been developed to evaluate the breadth-search capabilities of web agents, focusing on exhaustive set enumeration rather than depth-based question answering. This Korean-language benchmark, constructed via an automated pipeline, comprises 228 tables across 190 entities and sixteen categories. Initial testing with twenty web agents revealed consistent failures in accurately recovering row-level attributes, even when the overall set membership was correctly identified, indicating a significant challenge for current AI systems. AI

IMPACT Highlights a critical gap in web agent capabilities, suggesting current models need improvement in structured data extraction and exhaustive enumeration.

RANK_REASON The cluster describes a new academic benchmark for evaluating AI capabilities, published on arXiv.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

New Ko-WideSearch benchmark reveals web agents struggle with breadth-search tasks

COVERAGE [3]

  1. arXiv cs.CL TIER_1 English(EN) · Minbyul Jeong ·

    Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

    arXiv:2606.27595v1 Announce Type: new Abstract: Web-agent benchmarks overwhelmingly measure depth -- pinning one obscure answer behind a chain of constraints -- while breadth, exhaustively enumerating a closed set and filling each item's attributes, is barely evaluated, especiall…

  2. arXiv cs.CL TIER_1 English(EN) · Minbyul Jeong ·

    Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

    Web-agent benchmarks overwhelmingly measure depth -- pinning one obscure answer behind a chain of constraints -- while breadth, exhaustively enumerating a closed set and filling each item's attributes, is barely evaluated, especially outside English. Breadth is also hard to build…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

    A Korean web-agent benchmark evaluates breadth of search capabilities by requiring complete enumeration of entity memberships with attribute tables, revealing consistent failures in row recovery despite accurate set identification.