New RL algorithm fix boosts GSM8K accuracy by 45 points

By PulseAugur Editorial · [1 sources] · 2026-05-08 12:58

Researchers have identified a critical issue in the Group Relative Policy Optimization (GRPO) algorithm when applied to binary rewards, leading to "gradient starvation." This occurs when all responses in a group are either correct or incorrect, resulting in zero learning signal. The study proves this degeneracy is worse than previously thought and demonstrates that a simple fix, the fixed-reference Sign advantage, significantly improves performance. On the GSM8K dataset, this fix boosted accuracy by 45.4 points compared to the standard GRPO method. AI

IMPACT Improves reinforcement learning from human feedback (RLHF) for models trained on binary rewards, potentially enhancing performance on tasks like code generation.

RANK_REASON The cluster contains an academic paper detailing a novel algorithm fix and benchmark results. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

paper
other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.LG TIER_1 English(EN) · Jyh-Shing Roger Jang · 2026-05-08 12:58

Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works

Group Relative Policy Optimization (GRPO) is a standard algorithm for reinforcement learning from verifiable rewards, but its group-mean-centered advantage can fail under binary rewards. The failure mode is gradient starvation: when every response in a group is correct or every r…

COVERAGE [1]

Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works

RELATED ENTITIES

RELATED TOPICS