Researchers have developed a new benchmark called When2Tool to evaluate when Large Language Model (LLM) agents should use external tools. The benchmark reveals that LLMs possess an internal understanding of tool necessity, detectable in their hidden states, but fail to act on this knowledge during generation. A proposed method, Probe&Prefill, leverages this internal signal to significantly reduce unnecessary tool calls with minimal accuracy loss, outperforming existing baselines. AI
IMPACT Improves LLM agent efficiency by reducing unnecessary tool calls, potentially lowering costs and latency for AI applications.
RANK_REASON The cluster contains an academic paper proposing a new benchmark and method for evaluating LLM agent tool usage. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →