Anthropic has released a new framework for classifying and rating AI jailbreaks, called the Cyber Jailbreak Severity (CJS) scale. This scale categorizes jailbreaks from CJS-0 to CJS-4 based on factors like capability gain, breadth of attack types enabled, ease of weaponization, and discoverability. The company is also detailing its updated cyber classifiers, which categorize requests into prohibited, high-risk dual-use, low-risk dual-use, and benign categories, with high-risk dual-use actions currently blocked until authorization controls are improved. Anthropic is seeking community feedback on both the CJS scale and potential cyber jailbreaks through a HackerOne program. AI
IMPACT Establishes a standardized language for AI jailbreak risks, potentially influencing safety protocols and regulatory discussions across the industry.
RANK_REASON Research milestone publication by an AI lab detailing a new safety framework. [lever_c_demoted from research: ic=1 ai=1.0]
- An Ape and a Fox
- Anthropic
- CJS-0
- CJS-4
- Claude Fable 5
- Cyber Jailbreak Severity (CJS) scale
- Hackerone
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →