Erased but Not Forgotten: How Backdoors Compromise Concept Erasure
Researchers have identified a significant vulnerability in concept erasure techniques designed for text-to-image diffusion models, termed the Erasure Evasion Backdoor (EEB). This backdoor allows adversaries to embed a hidden trigger linked to a concept slated for removal, ensuring that harmful content associated with that concept can still be generated even after erasure attempts. The EEB was shown to be effective across multiple state-of-the-art erasure methods, leading to substantial success rates in generating unwanted outputs, including celebrity likenesses and explicit imagery. AI
IMPACT Highlights a critical flaw in AI safety mechanisms, necessitating new methods to ensure genuine concept removal and prevent misuse.