Token-weighted Direct Preference Optimization with Attention
Researchers have introduced Token-weighted Direct Preference Optimization (TwDPO), a new method for aligning large language models with human preferences. Unlike standard DPO, TwDPO assigns different importance weights to individual tokens within a response. The proposed instantiation, AttentionPO, leverages the LLM's own attention mechanisms to dynamically estimate these token weights, making the process content-aware and efficient. Experiments demonstrate that AttentionPO significantly enhances performance on benchmarks like AlpacaEval and MT-Bench compared to existing preference optimization techniques. AI
IMPACT This new method could lead to more nuanced and effective alignment of LLMs with human preferences, improving their helpfulness and safety.