METR has released reviews of Anthropic's sabotage risk reports for their Claude Opus models, specifically versions 4.6 and a Summer 2025 Pilot for 4 and 4.1. While METR generally agrees with Anthropic that the risk of catastrophic outcomes from these models is low, they identify areas where Anthropic's reasoning and analysis could be strengthened. Key disagreements include concerns about evaluation awareness potentially weakening alignment assessments and the possibility of undetected misaligned behaviors. AI
Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →
RANK_REASON METR's review of Anthropic's internal safety reports constitutes an external assessment of AI safety research and methodology.