New RL algorithm fix boosts GSM8K accuracy by 45 points

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have identified a critical issue in the Group Relative Policy Optimization (GRPO) algorithm when applied to binary rewards, leading to "gradient starvation." This occurs when all responses in a group are either correct or incorrect, resulting in zero learning signal. The study proves this degeneracy is worse than previously thought and demonstrates that a simple fix, the fixed-reference Sign advantage, significantly improves performance. On the GSM8K dataset, this fix boosted accuracy by 45.4 points compared to the standard GRPO method. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Improves reinforcement learning from human feedback (RLHF) for models trained on binary rewards, potentially enhancing performance on tasks like code generation.

RANK_REASON The cluster contains an academic paper detailing a novel algorithm fix and benchmark results. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

paper
other

COVERAGE [1]

arXiv cs.LG TIER_1 · Jyh-Shing Roger Jang · 2026-05-08 12:58

Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works

Group Relative Policy Optimization (GRPO) is a standard algorithm for reinforcement learning from verifiable rewards, but its group-mean-centered advantage can fail under binary rewards. The failure mode is gradient starvation: when every response in a group is correct or every r…

COVERAGE [1]

Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works

RELATED ENTITIES

RELATED TOPICS