PulseAugur
LIVE 11:15:02
research · [2 sources] ·
0
research

Microsoft Research: LLMs corrupt 25% of documents in delegated tasks

A new benchmark, DELEGATE-52, developed by Microsoft Research, reveals that current large language models significantly corrupt documents during delegated workflows. Even advanced models like Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 degraded approximately 25% of document content over extended editing tasks. Agentic tools further exacerbated this issue, adding an additional 6% corruption, indicating a widespread problem with trust and reliability in AI-assisted document editing across various professional domains. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Current LLMs introduce significant errors into documents during delegated tasks, undermining trust and readiness for enterprise adoption.

RANK_REASON The cluster reports on a new benchmark and its findings regarding LLM performance in document editing tasks.

Read on Mastodon — mastodon.social →

COVERAGE [2]

  1. Mastodon — mastodon.social TIER_1 · [email protected] ·

    LLMs Corrupt Your Documents When You Delegate Philippe Laban, Tobias Schnabel, Jennifer Neville ( # Microsoft Research) Large Language Models ( # LLMs ) are poi

    LLMs Corrupt Your Documents When You Delegate Philippe Laban, Tobias Schnabel, Jennifer Neville ( # Microsoft Research) Large Language Models ( # LLMs ) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). D…

  2. Mastodon — mastodon.social TIER_1 · AIntelligenceHub ·

    A new Microsoft Research benchmark called DELEGATE-52 found something enterprise teams need to know: even the best models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT

    A new Microsoft Research benchmark called DELEGATE-52 found something enterprise teams need to know: even the best models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupted 25% of document content over 20 interactions. Agentic tools added another 6% degradation. Only Python cod…