Researchers have introduced SaaS-Bench, a new benchmark designed to evaluate computer-using agents (CUAs) on realistic professional workflows. This benchmark utilizes 23 Software-as-a-Service (SaaS) systems across six domains, featuring 106 tasks that require long-horizon execution and can be text-only or multimodal. Initial experiments reveal that current LLM-based agents perform poorly, completing less than 4% of tasks end-to-end, highlighting significant limitations in planning, state tracking, and cross-application context maintenance. AI
IMPACT Highlights limitations in current AI agents for professional workflows, suggesting a need for improved planning and context management in real-world applications.
RANK_REASON The cluster contains a research paper introducing a new benchmark for evaluating AI agents. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →