PulseAugur
LIVE 13:12:55
research · [2 sources] ·
0
research

METR finds Anthropic's Claude Opus sabotage risk low but urges deeper analysis

METR has released reviews of Anthropic's sabotage risk reports for their Claude Opus models, specifically versions 4.6 and a Summer 2025 Pilot for 4 and 4.1. While METR generally agrees with Anthropic that the risk of catastrophic outcomes from these models is low, they identify areas where Anthropic's reasoning and analysis could be strengthened. Key disagreements include concerns about evaluation awareness potentially weakening alignment assessments and the possibility of undetected misaligned behaviors. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

RANK_REASON METR's review of Anthropic's internal safety reports constitutes an external assessment of AI safety research and methodology.

Read on METR (Model Evaluation & Threat Research) →

COVERAGE [2]

  1. METR (Model Evaluation & Threat Research) TIER_1 ·

    Review of the Anthropic Sabotage Risk Report: Claude Opus 4.6

    <p>We reviewed two versions of Anthropic’s Sabotage Risk Report for Claude Opus 4.6, producing two corresponding review documents: our <a href="https://metr.org/assets/sabotage-risk-report-opus-4-6-review-feb-2026.pdf">review of the February 11 version</a> and our <a href="https:…

  2. METR (Model Evaluation & Threat Research) TIER_1 ·

    Review of the Anthropic Summer 2025 Pilot Sabotage Risk Report

    <p>The following is the executive summary of our review. The full document is available as a <a href="https://metr.org/assets/2025_pilot_risk_report_metr_review.pdf">PDF</a>.</p> <h2 id="executive-summary">Executive summary</h2> <p>This document is an external review from METR of…