Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance
Researchers have developed a new geometric framework to understand the fragility of alignment in language models during fine-tuning. Their analysis reveals that even seemingly benign tasks can systematically break safety guardrails, a phenomenon they term "alignment collapse." The framework identifies specific geometric properties, formalized as the Alignment Instability Condition (AIC), that are sufficient to guarantee degradation of safety features. This work provides a theoretical basis for predicting and preventing such alignment degradation, showing that alignment can degrade rapidly even when initial updates appear safe. AI
IMPACT Provides a theoretical framework to predict and prevent alignment collapse in fine-tuned language models.