Researchers have introduced SaaS-Bench, a new benchmark designed to evaluate computer-using agents (CUAs) on realistic professional workflows within Software-as-a-Service (SaaS) environments. The benchmark comprises 106 tasks across 23 SaaS systems in six professional domains, requiring long-horizon execution and covering both text-only and multimodal scenarios. Initial experiments reveal that current LLM-based agents perform poorly, with the best models completing less than 4% of tasks end-to-end, highlighting significant limitations in planning, state tracking, and error recovery. AI
影响 Highlights the gap between current AI agent capabilities and the demands of real-world professional tasks, indicating a need for advancements in planning and context maintenance.
排序理由 The cluster describes the release of a new academic paper introducing a benchmark for AI agents. [lever_c_demoted from research: ic=1 ai=1.0]
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →