Brief · PulseAugur

TOOL · arXiv cs.LG English(EN) · 3w

CompassDPO: Dynamics-Controlled Direct Preference Optimization for Robust Safety Alignment

Researchers have introduced CompassDPO, a new framework designed to enhance the robustness of safety alignment in language models. This method addresses the sensitivity of Direct Preference Optimization (DPO) to imperfect supervision by controlling the optimization dynamics. CompassDPO uses an implicit reward margin as a guide to regulate the influence of samples on both the update direction and magnitude, without requiring external reward models or additional data. AI

IMPACT This new framework could lead to more reliable and robust AI safety alignment techniques, reducing the impact of noisy or imperfect training data.

Direct Preference Optimization
Yonghui Yang
PKU-SafeRLHF
CompassDPO