新方法增强LLM的On-Policy蒸馏

作者 PulseAugur 编辑部 · [5 个来源] · 2026-05-07 09:50

研究人员开发了新方法来提高大型语言模型On-Policy蒸馏（OPD）的效率和稳定性。一种方法vOPD使用源自反向KL散度的控制变量基线，在没有显著计算开销的情况下降低梯度方差。另一种方法ROPD仅使用教师生成的响应即可实现基于规则的蒸馏，提供了基于logit的OPD的黑盒兼容替代方案。第三种技术Near-Policy Distillation（NPD）通过异步生成和选择性打包来加速该过程，实现了显著的加速并优于标准微调。 AI

影响这些进展为对齐LLM提供了更有效、更稳定的方法，有可能加速它们在复杂推理任务中的部署。

排序理由多篇arXiv论文介绍了改进LLM中On-Policy蒸馏技术的新方法。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 5 个来源。我们如何撰写摘要 →

报道来源 [5]

arXiv cs.CL TIER_1 English(EN) · Tomas Pfister · 2026-05-11 17:40

RubricEM：基于可验证奖励的Rubric引导策略分解的Meta-RL

Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisi…
arXiv cs.AI TIER_1 English(EN) · Yohan Jo · 2026-05-08 15:24

KL for a KL: On-Policy Distillation with Control Variate Baseline

On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stabl…
arXiv cs.LG TIER_1 English(EN) · Tat-Seng Chua · 2026-05-08 07:52

基于评分卡的 on-policy 蒸馏

On-policy distillation (OPD) is a powerful paradigm for model alignment, yet its reliance on teacher logits restricts its application to white-box scenarios. We contend that structured semantic rubrics can serve as a scalable alternative to teacher logits, enabling OPD using only…
arXiv cs.LG TIER_1 English(EN) · Miao Rang, Zhenni Bi, Hang Zhou, Kai Han, Xuechun Wang, An Xiao, Xinghao Chen, Yunhe Wang, Hanting Chen · 2026-05-08 04:00

Near-Policy: 通过异步生成和选择性打包加速同策略蒸馏

arXiv:2605.05940v1 Announce Type: new Abstract: Standard knowledge distillation for autoregressive models often suffers from distribution mismatch. While on-policy methods mitigate this by leveraging student-generated outputs, they rely on computationally expensive Reinforcement …
arXiv cs.CL TIER_1 English(EN) · Hanting Chen · 2026-05-07 09:50

Near-Policy: 通过异步生成和选择性打包加速同策略蒸馏

Standard knowledge distillation for autoregressive models often suffers from distribution mismatch. While on-policy methods mitigate this by leveraging student-generated outputs, they rely on computationally expensive Reinforcement Learning (RL) frameworks. To improve efficiency,…

报道来源 [5]

RubricEM：基于可验证奖励的Rubric引导策略分解的Meta-RL

KL for a KL: On-Policy Distillation with Control Variate Baseline

基于评分卡的 on-policy 蒸馏

Near-Policy: 通过异步生成和选择性打包加速同策略蒸馏

Near-Policy: 通过异步生成和选择性打包加速同策略蒸馏

相关实体

相关话题