sGPO: Trading Inference FLOPs for Training Efficiency in RLVR
Researchers have developed a new training strategy called sorted Group Policy Optimization (sGPO) to improve the efficiency of Reinforcement Learning with Verifiable Rewards (RLVR). This method uses a small amount of inference computation to identify query difficulty, allowing for better allocation of training resources. By profiling queries and adapting the training group size, sGPO significantly reduces wasted computation and can decrease total training compute by up to three times while maintaining or improving performance. AI
IMPACT Reduces training compute for RLVR, potentially accelerating research and development in areas requiring verifiable rewards.