PulseAugur
实时 07:26:53

大型语言模型在新的基准测试中难以实现办公软件自动化

研究人员开发了一个新的基准测试,用于评估大型语言模型(LLM)在自动化专业生产力软件(如Word、Excel和PowerPoint)方面的能力。他们的评估基于中国的全国计算机等级考试,包含200个实际任务和超过7000个机器可评分标准。前沿LLM表现困难,单轮模型最高得分仅为36.6%,即使是先进的代理系统也只能达到68.8%,远低于95.5%的参考分数。 AI

影响 突显了当前LLM在实际办公自动化任务中的显著局限性,表明通用推理与精确软件交互之间存在差距。

排序理由 学术论文,介绍LLM在办公软件自动化能力方面的新基准测试。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

报道来源 [2]

  1. arXiv cs.AI TIER_1 English(EN) · Tengchao Lv, Dongdong Zhang, Jiayu Ding, Yilin Jia, Yuzhong Zhao, Yupan Huang, Wenshan Wu, Xiangyang Zhou, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Furu Wei ·

    Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

    arXiv:2606.10956v1 Announce Type: new Abstract: The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an …

  2. arXiv cs.CL TIER_1 English(EN) · Furu Wei ·

    Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

    The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal environment for benchmarking document-auto…