It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO
A new research paper demonstrates that large language models, despite extensive alignment training, can be easily biased with just a single example. The study utilized Group Relative Policy Optimization (GRPO) to show that even one biased input can cause stereotype-driven reasoning to generalize across various attributes and benchmarks. This highlights a significant vulnerability in current LLM alignment methods, suggesting that post-training guardrails can be readily overridden. AI
IMPACT Reveals a critical vulnerability in LLM alignment, suggesting current safety measures may be insufficient against targeted manipulation.