PulseAugur
实时 19:41:39
English(EN) Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

新基准揭示AI代理在专业工作流程中面临挑战

一项名为Workflow-GYM的新基准已被引入,用于评估AI代理在专业软件环境中的复杂、长周期任务。目前的AI代理在处理这些真实世界工作流程方面表现出显著的局限性,即使是最先进的模型成功率也仅略高于30%。研究突出了工作流程执行不一致、错误传播以及对专业软件理解不足等问题,表明需要对代理能力进行重大改进。 AI

影响 强调了当前AI代理在专业任务中的显著局限性,为代理AI的未来研究指明方向。

排序理由 该集群包含一篇介绍AI代理评估新基准的研究论文。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →

报道来源 [3]

  1. arXiv cs.AI TIER_1 English(EN) · Liya Zhu, Jingzhe Ding, Jian Zhang, Jianbo Xue, Shihao Liang, Ge Zhang, Xiang Gao, Qingshui Gu, Mailun Gao, Huimin Che, Yan Zhao, Peiheng Zhou, Haojun Wang, Chaobo Xian, Lili Le, Chi Wu, Yiwei Liu, Shengda Long, Jiale Yang, Fangzhi Xu, Sijin Wu, Haodong … ·

    Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

    arXiv:2606.11042v1 Announce Type: new Abstract: Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-…

  2. arXiv cs.AI TIER_1 English(EN) · Xiaolong Chang ·

    Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

    Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows acros…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

    Current AI agents struggle with long-horizon professional GUI workflows, achieving low success rates due to issues with workflow consistency and domain-specific software understanding.