TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
Researchers have introduced Token-level Bregman Preference Optimization (TBPO), a novel method for aligning language models using pairwise preferences. Unlike existing approaches that focus on full sequences, TBPO optimizes at the token level, which is more aligned with how models generate text. This new method, which includes variants like TBPO-Q and TBPO-A, aims to improve training stability and output diversity across various benchmarks. AI
IMPACT Introduces a more principled approach to aligning language models, potentially improving their performance and stability in various tasks.