Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?
A new research paper introduces an evaluation framework for testing Large Language Model (LLM) agents' proficiency in using standard office software like Word, Excel, and PowerPoint. The study found that even advanced LLMs struggle with complex document automation tasks, with single-turn models scoring below 37% and more sophisticated agentic systems reaching only 68.8% on a 100-point scale. This highlights a significant gap in current LLM capabilities for fine-grained office automation. AI
IMPACT Highlights significant limitations in LLM agents for practical office automation tasks, indicating a need for further development in agentic capabilities and reasoning.