Researchers have identified a critical issue in the Group Relative Policy Optimization (GRPO) algorithm when applied to binary rewards, leading to "gradient starvation." This occurs when all responses in a group are either correct or incorrect, resulting in zero learning signal. The study proves this degeneracy is worse than previously thought and demonstrates that a simple fix, the fixed-reference Sign advantage, significantly improves performance. On the GSM8K dataset, this fix boosted accuracy by 45.4 points compared to the standard GRPO method. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Improves reinforcement learning from human feedback (RLHF) for models trained on binary rewards, potentially enhancing performance on tasks like code generation.
RANK_REASON The cluster contains an academic paper detailing a novel algorithm fix and benchmark results. [lever_c_demoted from research: ic=1 ai=1.0]