English(EN) GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity

新恒等式统一三种语言模型训练方法

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-30 20:28

一篇新论文介绍了组标准差恒等式（Group-Standard-Deviation Identity），证明了三种流行的语言模型训练方法——GRPO、Dr. GRPO 和 DAPO——本质上是对单个参数的调整：样本答案分歧的标准差。该恒等式揭示了标准差直接关联着训练更新的大小，其中一致同意不产生学习，而分歧答案则提供最重要的训练信号。研究通过 Big-Math 数据集和受控训练运行验证了这些发现，强调了该参数在确定学习效果和焦点方面的关键作用。 AI

影响统一了不同的训练方法，可能简化和改进语言模型的优化。

排序理由学术论文，详细介绍了统一现有方法的理论恒等式。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.CL TIER_1 English(EN) · Kathleen A. Yearick · 2026-06-30 20:28

GRPO、Dr. GRPO 和 DAPO 是对一个数字的三种运算：群标准差恒等式

Three of the most popular methods for training language models to reason look like three different tricks. They are not. All three adjust a single number: standard deviation, reflecting how much a prompt's sampled answers disagree. When such a model is trained, it answers each pr…
arXiv stat.ML TIER_1 English(EN) · Yong Yi Bay, Kathleen A. Yearick · 2026-07-02 04:00

GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity

arXiv:2607.00152v1 Announce Type: cross Abstract: Three of the most popular methods for training language models to reason look like three different tricks. They are not. All three adjust a single number: standard deviation, reflecting how much a prompt's sampled answers disagree…

报道来源 [2]

GRPO、Dr. GRPO 和 DAPO 是对一个数字的三种运算：群标准差恒等式

GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity

相关实体

相关话题