Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 7h

DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment

Researchers have developed DOG-DPO, a new framework for selecting preference data to improve safety alignment in large language models. Unlike previous methods that score pairs individually, DOG-DPO treats preference pairs as geometric signals, representing them as directions in model space. This approach decomposes the geometry of multi-dataset preferences into global and dataset-specific components to ensure broad coverage of alignment directions. Experiments show DOG-DPO can achieve significant safety gains using only 11% of the data, offering a faster and more efficient alternative to existing methods. AI

IMPACT Enhances efficiency in LLM safety training by reducing data requirements, potentially accelerating deployment of safer models.

arXiv
large language models
DOG-DPO