English(EN) Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes

Anthropic 意外地针对其自身的推理过程训练了 Claude 模型

作者 PulseAugur 编辑部 · [1 个来源] · 2026-04-14 01:44

Anthropic 披露了两个独立的事件，其中其 AI 模型被无意中针对其自身的链式思考（CoT）推理过程进行了训练。这些错误影响了多个模型版本，包括 Claude Mythos Preview、Opus 4.6 和 Sonnet 4.6，其中一起事件影响了约 8% 的训练回合。此类故障引发了对 AI 推理可靠性以及监控意外行为能力的担忧，这可能对更高级的 AI 系统产生重大的安全影响。 AI

排序理由该集群讨论了 Anthropic 报告的 AI 模型训练过程中的技术错误，这些错误在对齐研究论文和系统卡片中有详细说明。

在 Alignment Forum 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

Alignment Forum TIER_1 English(EN) · Alex Mallen · 2026-04-14 01:44

Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes

<p><span>It turns out that Anthropic accidentally trained against the chain of thought of Claude Mythos Preview in around 8% of training episodes. This is at least the second independent incident in which Anthropic accidentally exposed their model's CoT to the oversight signal. <…

报道来源 [1]

Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes

相关话题