Understanding helpfulness and harmless tension in reward models
A new research paper explores the tension between helpfulness and harmlessness in AI reward models, a crucial component of reinforcement learning from human feedback (RLHF). The study found that models trained on mixed objectives often underperform those trained on single objectives, suggesting interference between the goals. By identifying and ablating specific neurons, researchers observed that these neurons causally support one objective while negatively impacting the other, with shared neurons playing a significant role in this alignment tension. The findings offer mechanistic insights into why multi-objective alignment is challenging and suggest avenues for developing more disentangled and controllable alignment methods. AI
IMPACT Highlights challenges in aligning AI models with multiple objectives, potentially guiding future research in controllable AI safety.