PulseAugur
EN
LIVE 09:52:44

New benchmark reveals AI agents struggle with real-world SaaS tasks

Researchers have introduced SaaS-Bench, a new benchmark designed to evaluate computer-using agents (CUAs) on realistic professional workflows. This benchmark utilizes 23 Software-as-a-Service (SaaS) systems across six domains, featuring 106 tasks that require long-horizon execution and can be text-only or multimodal. Initial experiments reveal that current LLM-based agents perform poorly, completing less than 4% of tasks end-to-end, highlighting significant limitations in planning, state tracking, and cross-application context maintenance. AI

IMPACT Highlights limitations in current AI agents for professional workflows, suggesting a need for improved planning and context management in real-world applications.

RANK_REASON The cluster contains a research paper introducing a new benchmark for evaluating AI agents. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.AI TIER_1 English(EN) · Kean Shi, Zihang Li, Tianyi Ma, Zengji Tu, Jialong Wu, Xinbo Xu, Qingyao Yang, Ruoyu Wu, Weichu Xie, Ming Wu, Jason Zeng, Michael Heinrich, Elvis Zhang, Liang Chen, Kuan Li, Baobao Chang ·

    SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

    arXiv:2605.15777v2 Announce Type: replace Abstract: Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However…