Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 6h

Can We Stop Malicious AI? KILLBENCH: A Benchmark for External AI Kill Switch Feasibility

Researchers have developed KILLBENCH, a new benchmark designed to evaluate the effectiveness of external AI kill switches. This benchmark focuses on web agents, which are widely deployed, and tests various methods for halting malicious AI behavior without accessing internal parameters. KILLBENCH includes four malicious AI agent configurations, eight harmful scenarios, and prompts derived from ten jailbreak patterns, aiming to assess the feasibility of external AI kill switches against advanced models like Claude "Mythos". The study also evaluates four external AI kill switch defense methods across several AI models, including Grok-4.3, GPT-5.2, and Gemma4. AI

IMPACT Establishes a new evaluation framework for AI safety, crucial for understanding and mitigating risks from increasingly capable AI agents.

GPT-5.2
Claude "Mythos"
OpenClaw
Grok-4.3
Gemma4
Qwen3.6
KILLBENCH
Qwen3.5-uncensored
Sechan Lee