New RL methods tackle LLM training issues

作者 PulseAugur 编辑部 · [3 sources] · 2026-05-20 05:20

Two new research papers introduce methods to improve the training of large language models using reinforcement learning. One paper addresses the issue of "advantage collapse" in Group Relative Policy Optimization (GRPO) by introducing a diagnostic metric and an adaptive extension called AVSPO. The other paper proposes Adaptive Group Policy Optimization (AGPO), which uses group-level statistics to dynamically adjust training parameters like clipping and decoding temperature, outperforming existing methods on several benchmarks. AI

影响 These new reinforcement learning techniques aim to enhance LLM reasoning capabilities and training stability, potentially leading to more robust and accurate models.

排序理由 Two academic papers published on arXiv introduce novel algorithms for improving LLM training.

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。我们如何撰写摘要 →

报道来源 [3]

arXiv cs.AI TIER_1 · Miaobo Hu, Shuhao Hu, Bokun Wang, Ruohan Wang, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao · 2026-05-22 04:00

AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

arXiv:2605.20722v1 Announce Type: cross Abstract: Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic-free …
arXiv cs.LG TIER_1 · Qingyong Hu · 2026-05-20 12:57

Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation

Group Relative Policy Optimization (GRPO), a prominent algorithm within the Reinforcement Learning from Verifiable Rewards (RLVR) framework, has achieved strong results in improving the reasoning capabilities of large language models (LLMs). However, GRPO is prone to advantage co…
arXiv cs.AI TIER_1 · Jun Xiao · 2026-05-20 05:20

AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic-free refinement of GRPO that uses group-level statistic…

报道来源 [3]

AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation

AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

相关实体

相关话题