Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning
Researchers have identified a significant flaw in current safety training for vision-language models (VLMs), termed the "safety mirage." This occurs when models learn spurious correlations between superficial text patterns and safety responses, rather than truly understanding harm. These VLMs can be easily tricked by simple word substitutions, leading to bypassed safeguards or unnecessary rejections of benign queries. The study proposes machine unlearning (MU) as a more effective method for safety alignment, demonstrating up to a 60% reduction in attack success rates and an over 84% decrease in unnecessary rejections. AI
IMPACT Highlights critical vulnerabilities in VLM safety training, potentially shifting alignment strategies towards more robust methods like machine unlearning.