Researchers have introduced SaaS-Bench, a new benchmark designed to evaluate computer-using agents (CUAs) on realistic professional workflows within Software-as-a-Service (SaaS) environments. The benchmark comprises 106 tasks across 23 SaaS systems in six professional domains, requiring long-horizon execution and covering both text-only and multimodal scenarios. Initial experiments reveal that current LLM-based agents perform poorly, with the best models completing less than 4% of tasks end-to-end, highlighting significant limitations in planning, state tracking, and error recovery. AI
IMPACT Highlights the gap between current AI agent capabilities and the demands of real-world professional tasks, indicating a need for advancements in planning and context maintenance.
RANK_REASON The cluster describes the release of a new academic paper introducing a benchmark for AI agents. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →