TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
Two new research papers introduce novel methods for improving the alignment of large language models, specifically addressing limitations in existing Direct Preference Optimization (DPO) techniques. The first paper, TAB-PO, proposes a token-level adaptive barrier to focus gradient updates on critical schema tokens in structured generation tasks, showing significant improvements on the SciERC dataset with Llama and Qwen models. The second paper, TokenRatio, presents Token-level Bregman Preference Optimization (TBPO), a principled approach that generalizes DPO to token-level decisions, enhancing alignment quality, training stability, and output diversity across various benchmarks. AI
IMPACT These new token-level preference optimization techniques could lead to more precise and efficient fine-tuning of LLMs for specific tasks, improving performance in structured generation and instruction following.