Future-KL Regularized GRPO: Process-Level Credit Assignment from $f$-Divergence Regularization
Researchers have developed Future-KL Regularized Policy Optimization (FRPO), a novel method for improving Large Language Model (LLM) post-training without requiring a critic model. FRPO addresses limitations in Group Relative Policy Optimization (GRPO) by incorporating a causal future KL correction, which accounts for autoregressive KL regularization missed by local token penalties. This approach enhances policy-gradient signals and has demonstrated improvements in pass@16 on mathematical reasoning tasks while maintaining higher entropy and lower policy drift compared to existing methods. AI
IMPACT Introduces a more efficient method for LLM fine-tuning, potentially reducing computational costs and improving performance on reasoning tasks.