Brief · PulseAugur

TOOL · arXiv cs.LG English(EN) · 5h

When Autoregressive Consistency Hurts Safety Alignment

Researchers have identified autoregressive consistency as a key factor in the fragility of safety alignment in large language models. This phenomenon, where next-token prediction reinforces existing response trajectories, can lead to alignment updates being concentrated on early tokens. The paper proposes that this mechanism explains shallow safety alignment and can be exploited by attacks that introduce harmful continuations at arbitrary points. To address this, the authors introduce adversarial safety alignment, a framework designed to break harmful autoregressive consistency throughout the output trajectory. AI

IMPACT Identifies a core mechanism that can undermine LLM safety, potentially leading to new alignment techniques and attack vectors.

Large Language Models
Autoregressive Consistency