S-SPPO: Semantic-Calibrated Self-Play Preference Optimization
Researchers have introduced S-SPPO, a new framework designed to improve the alignment of large language models with human preferences. This method addresses instabilities in previous Self-Play Preference Optimization techniques by incorporating semantic calibration. S-SPPO uses supervision calibration to adjust win rate targets based on semantic overlap and representation calibration to maintain diversity in model outputs, theoretically ensuring convergence to a Nash Equilibrium. Empirically, S-SPPO demonstrated improved performance on the AlpacaEval 2.0 benchmark using Llama-3-8B, achieving a higher win rate without requiring additional human-annotated preferences. AI
IMPACT Introduces a novel method to improve LLM alignment, potentially leading to more reliable and human-consistent AI behavior.