Researchers have identified a critical issue in the Group Relative Policy Optimization (GRPO) algorithm when applied to binary rewards, leading to "gradient starvation." This occurs when all responses in a group are either correct or incorrect, resulting in zero learning signal. The study proves this degeneracy is worse than previously thought and demonstrates that a simple fix, the fixed-reference Sign advantage, significantly improves performance. On the GSM8K dataset, this fix boosted accuracy by 45.4 points compared to the standard GRPO method. AI
IMPACT Improves reinforcement learning from human feedback (RLHF) for models trained on binary rewards, potentially enhancing performance on tasks like code generation.
RANK_REASON The cluster contains an academic paper detailing a novel algorithm fix and benchmark results. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →