Rollout-Level Advantage-Prioritized Experience Replay for GRPO
Researchers have developed a new experience replay method for GRPO, a reinforcement learning technique used to improve LLM reasoning. This method addresses the sample inefficiency of standard GRPO by storing and sampling individual rollouts, preventing them from becoming stale and destabilizing training. The proposed system prioritizes rollouts based on their advantage magnitude, allowing for efficient recycling of valuable data. Experiments on Qwen3-Base models demonstrated significant performance gains across multiple math benchmarks, with larger models showing greater improvements. AI
IMPACT Enhances LLM training efficiency, potentially leading to faster development of more capable reasoning models.