A new benchmark, DELEGATE-52, developed by Microsoft Research, reveals that current large language models significantly corrupt documents during delegated workflows. Even advanced models like Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 degraded approximately 25% of document content over extended editing tasks. Agentic tools further exacerbated this issue, adding an additional 6% corruption, indicating a widespread problem with trust and reliability in AI-assisted document editing across various professional domains. AI
Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →
IMPACT Current LLMs introduce significant errors into documents during delegated tasks, undermining trust and readiness for enterprise adoption.
RANK_REASON The cluster reports on a new benchmark and its findings regarding LLM performance in document editing tasks.