大型语言模型在新的基准测试中难以实现办公软件自动化

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-09 14:59

研究人员开发了一个新的基准测试，用于评估大型语言模型（LLM）在自动化专业生产力软件（如Word、Excel和PowerPoint）方面的能力。他们的评估基于中国的全国计算机等级考试，包含200个实际任务和超过7000个机器可评分标准。前沿LLM表现困难，单轮模型最高得分仅为36.6%，即使是先进的代理系统也只能达到68.8%，远低于95.5%的参考分数。 AI

影响突显了当前LLM在实际办公自动化任务中的显著局限性，表明通用推理与精确软件交互之间存在差距。

排序理由学术论文，介绍LLM在办公软件自动化能力方面的新基准测试。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Tengchao Lv, Dongdong Zhang, Jiayu Ding, Yilin Jia, Yuzhong Zhao, Yupan Huang, Wenshan Wu, Xiangyang Zhou, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Furu Wei · 2026-06-10 04:00

Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

arXiv:2606.10956v1 Announce Type: new Abstract: The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an …
arXiv cs.CL TIER_1 English(EN) · Furu Wei · 2026-06-09 14:59

Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal environment for benchmarking document-auto…

报道来源 [2]

Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

相关实体

相关话题