PulseAugur
实时 11:40:09

新方法增强了用于 LLM 训练的 on-policy distillation

研究人员开发了改进 on-policy distillation (OPD) 的新方法,OPD 是一种利用大型模型训练小型语言模型的技术。一种方法 TIP,通过分析学生熵和师生分歧来识别信息性 token,实现了显著的内存减少和性能提升。另一种方法 SimCT,通过扩展监督空间以包含多 token 续写来解决不同分词器的问题,恢复了丢失的信号并提高了推理和代码生成任务的性能。此外,EffOPD 通过优化更新轨迹和模块分配来加速 OPD 训练,实现了三倍的速度提升。 AI

影响 这些研究进展提供了更有效、更高效地训练小型语言模型的方法,有望降低计算成本并提高复杂推理任务的性能。

排序理由 该集群包含多篇学术论文,详细介绍了用于大型语言模型的 on-policy distillation 的新方法和理论见解。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 6 个来源。 我们如何撰写摘要 →

报道来源 [6]

  1. arXiv cs.AI TIER_1 English(EN) · Aristotelis Lazaridis, Dylan Bates, Aman Sharma, Brian King, Vincent Lu, Jack FitzGerald ·

    EDGE-OPD:通过证据引导的策略内蒸馏实现特权上下文的内部化

    arXiv:2605.23493v1 Announce Type: new Abstract: On-Policy Distillation (OPD) has gained wide attraction as an LLM post-training paradigm due to its effectiveness in improving capabilities without introducing model distribution drift, and consequently, regression in general tasks.…

  2. arXiv cs.AI TIER_1 English(EN) · Jack FitzGerald ·

    EDGE-OPD:通过证据引导的策略内蒸馏实现特权上下文的内部化

    On-Policy Distillation (OPD) has gained wide attraction as an LLM post-training paradigm due to its effectiveness in improving capabilities without introducing model distribution drift, and consequently, regression in general tasks. On-Policy Self-Distillation (OPSD) is an effici…

  3. arXiv cs.AI TIER_1 English(EN) · Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard ·

    TIP:在策略内蒸馏中的 Token 重要性

    arXiv:2604.14084v4 Announce Type: replace-cross Abstract: On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We as…

  4. arXiv cs.CL TIER_1 English(EN) · Jie Sun, Mao Zheng, Mingyang Song, Qiyong Zhong, Yilin Cheng, Bichuan Feng, Pengfei Liu, Junfeng Fang, Xiang Wang ·

    SimCT:为跨分词器策略内蒸馏恢复丢失的监督

    arXiv:2605.07711v2 Announce Type: replace Abstract: On-policy distillation (OPD) is a standard tool for transferring teacher behavior to a smaller student, but it implicitly assumes that teacher and student predictions are comparable token by token, an assumption that fails whene…

  5. arXiv cs.CL TIER_1 English(EN) · Yuchen Cai, Ding Cao, Liang Lin, Chunxi Luo, Xin Xu, Kai Yang, Weijie Liu, Saiyong Yang, Tianxiang Zhao, Guangzhong Sun, Guiquan Liu, Junfeng Fang ·

    学习预见:揭示 on-policy distillation 的效率解锁

    arXiv:2605.11739v3 Announce Type: replace Abstract: On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-lev…

  6. arXiv cs.LG TIER_1 English(EN) · Xiaogeng Liu, Xinyan Wang, Yingzi Ma, Yechao Zhang, Chaowei Xiao ·

    教师令牌何时可靠?用于推理的位置加权按策略自蒸馏

    arXiv:2605.21606v1 Announce Type: new Abstract: On-policy self-distillation (OPSD) trains a student on its own rollouts using a privileged teacher, but its standard objective weights all generated tokens equally, implicitly treating the privileged teacher target as equally reliab…