Rapid Poison: Practical Poisoning Attacks Against the Rapid Response Framework
Researchers have detailed a new method called 'Rapid Poison' that can exploit vulnerabilities in AI safety frameworks, specifically targeting Anthropic's Rapid Response system. This attack allows adversaries to inject poisoned data into the training pipeline, leading to either false positives that misclassify harmless inputs as jailbreaks or concept-based backdoor attacks that cause false negatives on actual jailbreaks. The Omission Attack, a key component, exploits a phenomenon where classifiers misassociate concepts with safe labels when trained on concept-absent unsafe samples. Even with a low 1% poisoning rate, these attacks can achieve significant label flipping, resulting in high false positive and false negative rates. AI
IMPACT Demonstrates a novel attack vector that could undermine the robustness of AI safety mechanisms, necessitating improved defenses against data poisoning.