Brief · PulseAugur

RESEARCH · arXiv cs.CL English(EN) · 1d · [2 sources]

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

A new research paper explores the phenomenon of supervised fine-tuning (SFT) overtraining in reinforcement learning from human feedback (RLHF) for code generation models. The study, focusing on Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B, found that SFT can compress the distribution of rewards, leading to rank inversion where initially promising checkpoints perform poorly after RLHF. Researchers propose a two-stage diagnostic using pre-RL and early RL entropy monitoring to identify and stop failing runs, noting that standard regularization techniques did not resolve the issue. AI

IMPACT Identifies a critical failure mode in RLHF for code generation, potentially improving model training efficiency and reliability.

rank inversion
supervised fine-tuning
DeepSeek-Coder-6.7B
RLVR
Qwen2.5-Coder-3B
Siddharth Aphale
GRPO