PulseAugur
EN
LIVE 06:09:58
tool · [1 source] · · 中文(ZH) Claude 通过率不到4%,SaaS-Bench撕碎了Computer-Use的「全自动办公」幻想

AI agents fail real-world tasks, new SaaS-Bench reveals

A new benchmark called SaaS-Bench has revealed that current AI agents struggle significantly with real-world, long-horizon tasks, with top models like Claude Opus 4.7 achieving less than 4% success rate on fully completing tasks. The benchmark uses actual SaaS systems and data, exposing four key failure modes: inability to maintain performance over extended tasks, cascading errors from single mistakes, a lack of self-checking mechanisms, and inconsistent performance across multiple runs. These findings suggest that the current paradigm for AI agents is insufficient for true automation and that software interfaces may need to be redesigned for AI agents rather than expecting them to operate human-centric interfaces. AI

Summary written by gemini-2.5-flash-lite from 1 sources. How we write summaries →

IMPACT Reveals significant limitations in current AI agents for real-world automation, suggesting a need for new paradigms and software redesigns for AI interaction.

RANK_REASON The cluster reports on a new benchmark and its findings regarding AI agent performance. [lever_c_demoted from research: ic=1 ai=1.0]

Read on 量子位 (QbitAI) →

COVERAGE [1]

  1. 量子位 (QbitAI) TIER_1 中文(ZH) · 量子位的朋友们 ·

    Claude's Pass Rate Under 4%, SaaS-Bench Tears Apart Computer-Use's 'Fully Automated Office' Fantasy

    UniPat AI 发布 SaaS-Bench 评测,Claude 等主流大模型在真实办公任务中完全通过率最高仅 3.8%,AI 全自动办公远未落地。