PulseAugur
实时 07:06:15

MCP-Atlas benchmark tests LLM tool-use competency with real servers

Researchers have introduced MCP-Atlas, a new benchmark designed to evaluate the tool-use capabilities of large language models. This benchmark features 36 real MCP servers and 220 tools, with 1,000 tasks requiring multi-step workflows and orchestration of multiple tool calls. Initial evaluations on advanced models show that while top performers exceed 50% pass rates, common failures stem from issues in tool usage and task comprehension. AI

影响 Establishes a new standard for evaluating LLM tool-use, potentially driving improvements in agentic capabilities and real-world application integration.

排序理由 Introduction of a new benchmark dataset for evaluating LLM tool-use competency. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

MCP-Atlas benchmark tests LLM tool-use competency with real servers

报道来源 [1]

  1. arXiv cs.AI TIER_1 English(EN) · Chaithanya Bandi, Ben Hertzberg, Geobio Boo, Tejas Polakam, Jeff Da, Sami Hassaan, Manasi Sharma, Andrew Park, Ernesto Hernandez, Dan Rambado, Ivan Salazar, Rafael Cruz, Chetan Rane, Ben Levin, Brad Kenstler, Bing Liu ·

    MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

    arXiv:2602.00933v2 Announce Type: replace-cross Abstract: The Model Context Protocol (MCP) is rapidly becoming the standard interface for Large Language Models (LLMs) to discover and invoke external tools. However, existing evaluations often fail to capture the complexity of real…