Researchers have developed a new benchmark to test Large Language Model (LLM) capabilities in automating professional productivity software like Word, Excel, and PowerPoint. Their evaluation, based on China's National Computer Rank Examination, includes 200 practical tasks and over 7,000 machine-gradable criteria. Frontier LLMs struggled significantly, with single-turn models achieving a maximum score of 36.6%, and even advanced agentic systems reaching only 68.8%, falling short of the 95.5% reference score. AI
IMPACT Highlights significant limitations in current LLMs for real-world office automation tasks, indicating a gap between general reasoning and precise software interaction.
RANK_REASON Academic paper introducing a new benchmark for LLM capabilities in office software automation. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →