English(EN) Rollout-Level Advantage-Prioritized Experience Replay for GRPO

新的回放方法提升 GRPO 在 LLM 推理方面的性能

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-04 04:00

研究人员为 GRPO（一种用于改进 LLM 推理的强化学习技术）开发了一种新的经验回放方法。该方法通过存储和采样单个轮次，防止它们过时并破坏训练稳定性，从而解决了标准 GRPO 的样本效率低下问题。所提出的系统根据优势幅度优先处理轮次，从而能够高效地回收有价值的数据。在 Qwen3-Base 模型上的实验表明，在多个数学基准测试中性能显著提升，且模型越大，提升越明显。 AI

影响提高了 LLM 的训练效率，可能加速更强大的推理模型的开发。

排序理由详细介绍改进 LLM 训练新方法的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Gyeongtae Yoo, Sanghyeok Park, Soohyuk Jang, Ik-hwan Kim, Sungroh Yoon · 2026-06-04 04:00

Rollout-Level Advantage-Prioritized Experience Replay for GRPO

arXiv:2606.04560v1 Announce Type: cross Abstract: Reinforcement learning from verifiable rewards with GRPO is a standard approach for post-training reasoning LLMs. It remains sample inefficient. Each rollout is used for a single gradient update and then discarded. Naive replay is…

报道来源 [1]

Rollout-Level Advantage-Prioritized Experience Replay for GRPO

相关实体

相关话题