LLMs struggle with office software automation in new benchmark

By PulseAugur Editorial · [1 sources] · 2026-06-09 14:59

Researchers have developed a new benchmark to test Large Language Model (LLM) capabilities in automating professional productivity software like Word, Excel, and PowerPoint. Their evaluation, based on China's National Computer Rank Examination, includes 200 practical tasks and over 7,000 machine-gradable criteria. Frontier LLMs struggled significantly, with single-turn models achieving a maximum score of 36.6%, and even advanced agentic systems reaching only 68.8%, falling short of the 95.5% reference score. AI

IMPACT Highlights significant limitations in current LLMs for real-world office automation tasks, indicating a gap between general reasoning and precise software interaction.

RANK_REASON Academic paper introducing a new benchmark for LLM capabilities in office software automation. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.CL TIER_1 English(EN) · Furu Wei · 2026-06-09 14:59

Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal environment for benchmarking document-auto…

COVERAGE [1]

Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

RELATED ENTITIES

RELATED TOPICS