PulseAugur
EN
LIVE 08:31:46

New benchmark DiscoBench evaluates LLM search agents' ability to handle ambiguous queries

Researchers have introduced DiscoBench, a new benchmark designed to evaluate the ability of large language model (LLM) powered search agents to handle ambiguous queries. The benchmark includes 211 samples and 463 ambiguity instances across 11 domains, focusing on how agents identify vagueness, ask clarifying questions, and recover from incorrect search paths. Experiments indicate that ambiguity detection and clarification are distinct skills, and that agents often perform worse by repeatedly searching rather than asking for clarification, highlighting a gap in interactive problem-solving capabilities. AI

IMPACT This benchmark could drive improvements in LLM search agents, making them more effective at handling real-world, ambiguous user queries.

RANK_REASON The cluster describes a new benchmark for evaluating LLM search agents, which is a research contribution.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New benchmark DiscoBench evaluates LLM search agents' ability to handle ambiguous queries

COVERAGE [2]

  1. arXiv cs.CL TIER_1 English(EN) · Yiling Tao, Shihan Deng, Meiling Tao, Pengzhi Wei, Zhichao Hu, Zhihao Zhu ·

    When Search Agents Should Ask: DiscoBench for Clarification-Aware Deep Search

    arXiv:2606.27669v1 Announce Type: new Abstract: Search agents powered by large language models (LLMs) are increasingly used to solve complex information-seeking tasks, requiring multi-step retrieval and reasoning to fulfill user goals. However, existing benchmarks often assume th…

  2. arXiv cs.CL TIER_1 English(EN) · Zhihao Zhu ·

    When Search Agents Should Ask: DiscoBench for Clarification-Aware Deep Search

    Search agents powered by large language models (LLMs) are increasingly used to solve complex information-seeking tasks, requiring multi-step retrieval and reasoning to fulfill user goals. However, existing benchmarks often assume that user queries are complete and explicit, overl…