PulseAugur
EN
LIVE 11:46:28

AI research: SFT overtraining causes rank inversion in code generation models

A new research paper explores the phenomenon of supervised fine-tuning (SFT) overtraining in reinforcement learning from human feedback (RLHF) for code generation models. The study, focusing on Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B, found that SFT can compress the distribution of rewards, leading to rank inversion where initially promising checkpoints perform poorly after RLHF. Researchers propose a two-stage diagnostic using pre-RL and early RL entropy monitoring to identify and stop failing runs, noting that standard regularization techniques did not resolve the issue. AI

IMPACT Identifies a critical failure mode in RLHF for code generation, potentially improving model training efficiency and reliability.

RANK_REASON The cluster contains a research paper published on arXiv detailing findings on AI model training.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Siddharth Aphale, Kelly Liu ·

    SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

    arXiv:2606.18487v1 Announce Type: cross Abstract: The standard heuristic of selecting the SFT checkpoint with the highest pass@1 for GRPO can fail when SFT compresses the rollout distribution. For binary rewards, the expected within group advantage variance is $p(1{-}p)(g{-}1)/g$…

  2. arXiv cs.CL TIER_1 English(EN) · Kelly Liu ·

    SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

    The standard heuristic of selecting the SFT checkpoint with the highest pass@1 for GRPO can fail when SFT compresses the rollout distribution. For binary rewards, the expected within group advantage variance is $p(1{-}p)(g{-}1)/g$; when early GRPO drives $p$ below $p^*(g)$, most …