LLM agents know when to use tools, but fail to act on it

By PulseAugur Editorial · [1 sources] · 2026-05-22 04:00

Researchers have developed a new benchmark called When2Tool to evaluate when Large Language Model (LLM) agents should use external tools. The benchmark reveals that LLMs possess an internal understanding of tool necessity, detectable in their hidden states, but fail to act on this knowledge during generation. A proposed method, Probe&Prefill, leverages this internal signal to significantly reduce unnecessary tool calls with minimal accuracy loss, outperforming existing baselines. AI

IMPACT Improves LLM agent efficiency by reducing unnecessary tool calls, potentially lowering costs and latency for AI applications.

RANK_REASON The cluster contains an academic paper proposing a new benchmark and method for evaluating LLM agent tool usage. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.CL TIER_1 English(EN) · Chung-En Sun, Linbo Liu, Ge Yan, Zimo Wang, Tsui-Wei Weng · 2026-05-22 04:00

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

arXiv:2605.09252v2 Announce Type: replace Abstract: Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actu…

COVERAGE [1]

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

RELATED ENTITIES

RELATED TOPICS