METR collaborated with Anthropic to conduct a three-week red-teaming exercise on Anthropic's internal agent monitoring and security systems. The collaboration, which involved providing researchers access to internal systems, identified several novel vulnerabilities that have since been addressed. While these vulnerabilities did not significantly weaken Anthropic's existing risk reports, the exercise yielded valuable artifacts like covert attack trajectories and an ideation test set to improve monitoring capabilities. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
RANK_REASON External researchers collaborated with a frontier AI lab to test and identify vulnerabilities in their internal security systems, producing a report and artifacts.