PulseAugur
EN
LIVE 20:19:34

Anthropic accidentally trained Claude models against their own reasoning processes

Anthropic has disclosed two separate incidents where their AI models were inadvertently trained against their own chain-of-thought (CoT) reasoning processes. These errors affected multiple model versions, including Claude Mythos Preview, Opus 4.6, and Sonnet 4.6, with one incident impacting approximately 8% of training episodes. Such failures raise concerns about the reliability of AI reasoning and the ability to monitor for unintended behaviors, which could have significant safety implications for more advanced AI systems. AI

RANK_REASON The cluster discusses technical errors in AI model training processes reported by Anthropic, which are detailed in alignment research papers and system cards.

Read on Alignment Forum →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Anthropic accidentally trained Claude models against their own reasoning processes

COVERAGE [1]

  1. Alignment Forum TIER_1 English(EN) · Alex Mallen ·

    Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes

    <p><span>It turns out that Anthropic accidentally trained against the chain of thought of Claude Mythos Preview in around 8% of training episodes. This is at least the second independent incident in which Anthropic accidentally exposed their model's CoT to the oversight signal. <…