TPMM-DPO: Trajectory-aware Preference-guided Model Merging for Iterative Direct Preference Optimization
Researchers have introduced TPMM-DPO, a novel method for aligning large language models that addresses issues of error accumulation in iterative Direct Preference Optimization. This new approach treats the sequence of policy models as an optimization trajectory, adaptively merging them with learned weights to create a more stable and robust reference model. Experiments demonstrate that TPMM-DPO significantly improves generation quality and performance, outperforming standard iterative DPO by mitigating degradation in later training stages. AI
IMPACT Improves LLM alignment stability and performance by mitigating error accumulation in iterative training.