A new benchmark, DELEGATE-52, developed by Microsoft Research, reveals that current large language models significantly corrupt documents during delegated workflows. Even advanced models like Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 degraded approximately 25% of document content over extended editing tasks. Agentic tools further exacerbated this issue, adding an additional 6% corruption, indicating a widespread problem with trust and reliability in AI-assisted document editing across various professional domains. AI
影响 Current LLMs introduce significant errors into documents during delegated tasks, undermining trust and readiness for enterprise adoption.
排序理由 The cluster reports on a new benchmark and its findings regarding LLM performance in document editing tasks.
在 Mastodon — mastodon.social 阅读 →
AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →