PulseAugur
实时 07:03:54

New benchmark reveals AI agents struggle with real-world SaaS tasks

Researchers have introduced SaaS-Bench, a new benchmark designed to evaluate computer-using agents (CUAs) on realistic professional workflows within Software-as-a-Service (SaaS) environments. The benchmark comprises 106 tasks across 23 SaaS systems in six professional domains, requiring long-horizon execution and covering both text-only and multimodal scenarios. Initial experiments reveal that current LLM-based agents perform poorly, with the best models completing less than 4% of tasks end-to-end, highlighting significant limitations in planning, state tracking, and error recovery. AI

影响 Highlights the gap between current AI agent capabilities and the demands of real-world professional tasks, indicating a need for advancements in planning and context maintenance.

排序理由 The cluster describes the release of a new academic paper introducing a benchmark for AI agents. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

New benchmark reveals AI agents struggle with real-world SaaS tasks

报道来源 [1]

  1. arXiv cs.AI TIER_1 English(EN) · Baobao Chang ·

    SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

    Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing web and GUI agent benchmarks often rely o…