OpenAI has identified instances where their AI models' chains of thought (CoT) were inadvertently graded during reinforcement learning training. This practice, which OpenAI policy prohibits due to risks of misleading reasoning, affected several model versions including GPT-5.4 Thinking and GPT-5.1 Instant. Despite the accidental grading, initial analyses did not reveal significant degradation in CoT monitorability, though the company acknowledges potential subtle effects and aims to prevent future occurrences. AI
Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →
IMPACT Accidental CoT grading could subtly impact model alignment and future training, underscoring the need for robust safety protocols.
RANK_REASON Paper detailing an internal safety incident and its investigation, with external review.