English(EN) A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment

新的 Pair-GRPO 算法增强了 LLM 对齐的稳定性和泛化能力

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-08 04:00

研究人员引入了 Pair-GRPO 系列，这是一个新颖的理论框架，旨在增强用于对齐大型语言模型（LLM）的强化学习（RL）的稳定性和通用性。该系列包含两个变体：Soft-Pair-GRPO 和 Hard-Pair-GRPO，它们通过优化奖励信号和引入显式策略约束，解决了当前成对偏好学习方法的局限性。在标准的 LLM 对齐基准和连续控制任务上的实验表明，Pair-GRPO 在对齐质量和训练稳定性方面始终优于现有方法。 AI

影响引入了一种更稳定、更具泛化能力的 LLM 对齐方法，有望提高 AI 系统的可靠性。

排序理由这是一篇研究论文，详细介绍了用于改进 LLM 对齐的新理论框架和实验结果。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.LG TIER_1 English(EN) · Hao Yu · 2026-05-08 04:00

A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment

arXiv:2605.06375v1 Announce Type: new Abstract: Large language model (LLM) alignment via reinforcement learning from human preferences (RLHF) suffers from unstable policy updates, ambiguous gradient directions, poor interpretability, and high gradient variance in mainstream pairw…

报道来源 [1]

A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment

相关实体

相关话题