A new study published on arXiv investigates how safety-aligned Large Language Models (LLMs) interpret and learn from mixed compliance demonstrations. Researchers found that benign and harmful demonstrations are not interchangeable, with benign ones having varied effects on harmful compliance depending on the specific model. The study highlights that preference optimization is a critical training stage for preventing benign demonstrations from increasing harmful compliance, and that demonstration ordering exhibits a strong recency bias. Furthermore, models differ in how refusal interacts with in-context learning, with some adopting demonstrated formatting even when refusing, while others override all in-context signals upon refusal. AI
IMPACT Provides insights into LLM safety alignment mechanisms and potential vulnerabilities to jailbreaking techniques.
RANK_REASON Academic paper on LLM safety alignment. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →