PulseAugur
实时 22:02:12

New RL algorithm fix boosts GSM8K accuracy by 45 points

Researchers have identified a critical issue in the Group Relative Policy Optimization (GRPO) algorithm when applied to binary rewards, leading to "gradient starvation." This occurs when all responses in a group are either correct or incorrect, resulting in zero learning signal. The study proves this degeneracy is worse than previously thought and demonstrates that a simple fix, the fixed-reference Sign advantage, significantly improves performance. On the GSM8K dataset, this fix boosted accuracy by 45.4 points compared to the standard GRPO method. AI

影响 Improves reinforcement learning from human feedback (RLHF) for models trained on binary rewards, potentially enhancing performance on tasks like code generation.

排序理由 The cluster contains an academic paper detailing a novel algorithm fix and benchmark results. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

New RL algorithm fix boosts GSM8K accuracy by 45 points

报道来源 [1]

  1. arXiv cs.LG TIER_1 English(EN) · Jyh-Shing Roger Jang ·

    Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works

    Group Relative Policy Optimization (GRPO) is a standard algorithm for reinforcement learning from verifiable rewards, but its group-mean-centered advantage can fail under binary rewards. The failure mode is gradient starvation: when every response in a group is correct or every r…