新的自蒸馏方法增强了大型语言模型的推理能力和训练稳定性

作者 PulseAugur 编辑部 · [3 个来源] · 2026-05-06 15:31

两篇新论文探讨了用于大型语言模型的先进自蒸馏技术，旨在提高推理能力和效率。第一篇论文介绍了“Power Distribution Bridges”，它连接了采样、自奖励强化学习和自蒸馏，表明功率分布可以优化 KL 正则化强化学习并实现一种新的离线蒸馏形式。第二篇论文提出了“基于偏好的自蒸馏”（PBSD），超越了简单的 KL 匹配，采用了一种奖励正则化目标来优化偏好差距，从而提高了训练稳定性和在推理及工具使用基准测试上的性能。 AI

影响这些新的自蒸馏方法可能带来更高效的大型语言模型训练，并提高其推理能力，从而可能降低推理成本。

排序理由 arXiv 上发表的两篇学术论文介绍了大型语言模型中自蒸馏的新方法。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。我们如何撰写摘要 →

报道来源 [3]

arXiv cs.LG TIER_1 English(EN) · Akiyoshi Tomihari, Issei Sato · 2026-05-07 04:00

Power Distribution 桥接采样、自奖励 RL 和自蒸馏

arXiv:2605.04542v1 Announce Type: new Abstract: Recent analyses question whether reinforcement learning (RL) is responsible for strong reasoning in large language models (LLMs). At the same time, distillation and inference-time sampling, including power sampling, have emerged as …
arXiv cs.LG TIER_1 English(EN) · Xin Yu, Liuchen Liao, Yiwen Zhang, Yingchen Yu, Lingzhou Xue, Qinzhen Guo · 2026-05-07 04:00

基于偏好的自蒸馏：通过奖励正则化超越KL匹配

arXiv:2605.05040v1 Announce Type: new Abstract: On-policy distillation is an efficient alternative to reinforcement learning, offering dense token-level training signals. However, its reliance on a stronger external teacher has driven recent work on on-policy self-distillation, w…
arXiv cs.AI TIER_1 English(EN) · Qinzhen Guo · 2026-05-06 15:31

基于偏好的自蒸馏：通过奖励正则化超越KL匹配

On-policy distillation is an efficient alternative to reinforcement learning, offering dense token-level training signals. However, its reliance on a stronger external teacher has driven recent work on on-policy self-distillation, where the same model serves as both teacher and s…

报道来源 [3]

Power Distribution 桥接采样、自奖励 RL 和自蒸馏

基于偏好的自蒸馏：通过奖励正则化超越KL匹配

基于偏好的自蒸馏：通过奖励正则化超越KL匹配

相关实体

相关话题