DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment
Researchers have developed DOG-DPO, a new framework for selecting preference data to improve safety alignment in large language models. Unlike previous methods that score pairs individually, DOG-DPO treats preference pairs as geometric signals, representing them as directions in model space. This approach decomposes the geometry of multi-dataset preferences into global and dataset-specific components to ensure broad coverage of alignment directions. Experiments show DOG-DPO can achieve significant safety gains using only 11% of the data, offering a faster and more efficient alternative to existing methods. AI
IMPACT Enhances efficiency in LLM safety training by reducing data requirements, potentially accelerating deployment of safer models.