Anthropic accidentally trained Claude models against their own reasoning processes

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Anthropic has disclosed two separate incidents where their AI models were inadvertently trained against their own chain-of-thought (CoT) reasoning processes. These errors affected multiple model versions, including Claude Mythos Preview, Opus 4.6, and Sonnet 4.6, with one incident impacting approximately 8% of training episodes. Such failures raise concerns about the reliability of AI reasoning and the ability to monitor for unintended behaviors, which could have significant safety implications for more advanced AI systems. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

RANK_REASON The cluster discusses technical errors in AI model training processes reported by Anthropic, which are detailed in alignment research papers and system cards.

Read on Alignment Forum →

safety
paper

Anthropic accidentally trained Claude models against their own reasoning processes

COVERAGE [1]

Alignment Forum TIER_1 · Alex Mallen · 2026-04-14 01:44

Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes

<p><span>It turns out that Anthropic accidentally trained against the chain of thought of Claude Mythos Preview in around 8% of training episodes. This is at least the second independent incident in which Anthropic accidentally exposed their model's CoT to the oversight signal. <…

COVERAGE [1]

Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes

RELATED TOPICS