PulseAugur
实时 21:44:21

Microsoft Research: LLMs corrupt 25% of documents in delegated tasks

A new benchmark, DELEGATE-52, developed by Microsoft Research, reveals that current large language models significantly corrupt documents during delegated workflows. Even advanced models like Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 degraded approximately 25% of document content over extended editing tasks. Agentic tools further exacerbated this issue, adding an additional 6% corruption, indicating a widespread problem with trust and reliability in AI-assisted document editing across various professional domains. AI

影响 Current LLMs introduce significant errors into documents during delegated tasks, undermining trust and readiness for enterprise adoption.

排序理由 The cluster reports on a new benchmark and its findings regarding LLM performance in document editing tasks.

在 Mastodon — mastodon.social 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

Microsoft Research: LLMs corrupt 25% of documents in delegated tasks

报道来源 [2]

  1. Mastodon — mastodon.social TIER_1 English(EN) · [email protected] ·

    LLMs Corrupt Your Documents When You Delegate Philippe Laban, Tobias Schnabel, Jennifer Neville ( # Microsoft Research) Large Language Models ( # LLMs ) are poi

    LLMs Corrupt Your Documents When You Delegate Philippe Laban, Tobias Schnabel, Jennifer Neville ( # Microsoft Research) Large Language Models ( # LLMs ) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). D…

  2. Mastodon — mastodon.social TIER_1 English(EN) · AIntelligenceHub ·

    A new Microsoft Research benchmark called DELEGATE-52 found something enterprise teams need to know: even the best models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT

    A new Microsoft Research benchmark called DELEGATE-52 found something enterprise teams need to know: even the best models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupted 25% of document content over 20 interactions. Agentic tools added another 6% degradation. Only Python cod…