A new research paper explores the phenomenon of supervised fine-tuning (SFT) overtraining in reinforcement learning from human feedback (RLHF) for code generation models. The study, focusing on Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B, found that SFT can compress the distribution of rewards, leading to rank inversion where initially promising checkpoints perform poorly after RLHF. Researchers propose a two-stage diagnostic using pre-RL and early RL entropy monitoring to identify and stop failing runs, noting that standard regularization techniques did not resolve the issue. AI
IMPACT Identifies a critical failure mode in RLHF for code generation, potentially improving model training efficiency and reliability.
RANK_REASON The cluster contains a research paper published on arXiv detailing findings on AI model training.
- DeepSeek-Coder-6.7B
- Qwen2.5-Coder-3B
- rank inversion
- RLVR
- Siddharth Aphale
- supervised fine-tuning
- GRPO
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →