PulseAugur
实时 00:35:34

新的自蒸馏方法增强了大型语言模型的推理能力和训练稳定性

两篇新论文探讨了用于大型语言模型的先进自蒸馏技术,旨在提高推理能力和效率。第一篇论文介绍了“Power Distribution Bridges”,它连接了采样、自奖励强化学习和自蒸馏,表明功率分布可以优化 KL 正则化强化学习并实现一种新的离线蒸馏形式。第二篇论文提出了“基于偏好的自蒸馏”(PBSD),超越了简单的 KL 匹配,采用了一种奖励正则化目标来优化偏好差距,从而提高了训练稳定性和在推理及工具使用基准测试上的性能。 AI

影响 这些新的自蒸馏方法可能带来更高效的大型语言模型训练,并提高其推理能力,从而可能降低推理成本。

排序理由 arXiv 上发表的两篇学术论文介绍了大型语言模型中自蒸馏的新方法。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →

新的自蒸馏方法增强了大型语言模型的推理能力和训练稳定性

报道来源 [3]

  1. arXiv cs.LG TIER_1 English(EN) · Akiyoshi Tomihari, Issei Sato ·

    Power Distribution 桥接采样、自奖励 RL 和自蒸馏

    arXiv:2605.04542v1 Announce Type: new Abstract: Recent analyses question whether reinforcement learning (RL) is responsible for strong reasoning in large language models (LLMs). At the same time, distillation and inference-time sampling, including power sampling, have emerged as …

  2. arXiv cs.LG TIER_1 English(EN) · Xin Yu, Liuchen Liao, Yiwen Zhang, Yingchen Yu, Lingzhou Xue, Qinzhen Guo ·

    基于偏好的自蒸馏:通过奖励正则化超越KL匹配

    arXiv:2605.05040v1 Announce Type: new Abstract: On-policy distillation is an efficient alternative to reinforcement learning, offering dense token-level training signals. However, its reliance on a stronger external teacher has driven recent work on on-policy self-distillation, w…

  3. arXiv cs.AI TIER_1 English(EN) · Qinzhen Guo ·

    基于偏好的自蒸馏:通过奖励正则化超越KL匹配

    On-policy distillation is an efficient alternative to reinforcement learning, offering dense token-level training signals. However, its reliance on a stronger external teacher has driven recent work on on-policy self-distillation, where the same model serves as both teacher and s…