AI research: SFT overtraining causes rank inversion in code generation models

By PulseAugur Editorial · [2 sources] · 2026-06-16 20:59

A new research paper explores the phenomenon of supervised fine-tuning (SFT) overtraining in reinforcement learning from human feedback (RLHF) for code generation models. The study, focusing on Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B, found that SFT can compress the distribution of rewards, leading to rank inversion where initially promising checkpoints perform poorly after RLHF. Researchers propose a two-stage diagnostic using pre-RL and early RL entropy monitoring to identify and stop failing runs, noting that standard regularization techniques did not resolve the issue. AI

IMPACT Identifies a critical failure mode in RLHF for code generation, potentially improving model training efficiency and reliability.

RANK_REASON The cluster contains a research paper published on arXiv detailing findings on AI model training.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

arXiv cs.AI TIER_1 English(EN) · Siddharth Aphale, Kelly Liu · 2026-06-18 04:00

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

arXiv:2606.18487v1 Announce Type: cross Abstract: The standard heuristic of selecting the SFT checkpoint with the highest pass@1 for GRPO can fail when SFT compresses the rollout distribution. For binary rewards, the expected within group advantage variance is $p(1{-}p)(g{-}1)/g$…
arXiv cs.CL TIER_1 English(EN) · Kelly Liu · 2026-06-16 20:59

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

The standard heuristic of selecting the SFT checkpoint with the highest pass@1 for GRPO can fail when SFT compresses the rollout distribution. For binary rewards, the expected within group advantage variance is $p(1{-}p)(g{-}1)/g$; when early GRPO drives $p$ below $p^*(g)$, most …

COVERAGE [2]

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

RELATED ENTITIES

RELATED TOPICS