新方法增强了用于 LLM 训练的 on-policy distillation

作者 PulseAugur 编辑部 · [6 个来源] · 2026-05-22 04:00

研究人员开发了改进 on-policy distillation (OPD) 的新方法，OPD 是一种利用大型模型训练小型语言模型的技术。一种方法 TIP，通过分析学生熵和师生分歧来识别信息性 token，实现了显著的内存减少和性能提升。另一种方法 SimCT，通过扩展监督空间以包含多 token 续写来解决不同分词器的问题，恢复了丢失的信号并提高了推理和代码生成任务的性能。此外，EffOPD 通过优化更新轨迹和模块分配来加速 OPD 训练，实现了三倍的速度提升。 AI

影响这些研究进展提供了更有效、更高效地训练小型语言模型的方法，有望降低计算成本并提高复杂推理任务的性能。

排序理由该集群包含多篇学术论文，详细介绍了用于大型语言模型的 on-policy distillation 的新方法和理论见解。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 6 个来源。我们如何撰写摘要 →

报道来源 [6]

arXiv cs.AI TIER_1 English(EN) · Aristotelis Lazaridis, Dylan Bates, Aman Sharma, Brian King, Vincent Lu, Jack FitzGerald · 2026-05-25 04:00

EDGE-OPD：通过证据引导的策略内蒸馏实现特权上下文的内部化

arXiv:2605.23493v1 Announce Type: new Abstract: On-Policy Distillation (OPD) has gained wide attraction as an LLM post-training paradigm due to its effectiveness in improving capabilities without introducing model distribution drift, and consequently, regression in general tasks.…
arXiv cs.AI TIER_1 English(EN) · Jack FitzGerald · 2026-05-22 10:55

EDGE-OPD：通过证据引导的策略内蒸馏实现特权上下文的内部化

On-Policy Distillation (OPD) has gained wide attraction as an LLM post-training paradigm due to its effectiveness in improving capabilities without introducing model distribution drift, and consequently, regression in general tasks. On-Policy Self-Distillation (OPSD) is an effici…
arXiv cs.AI TIER_1 English(EN) · Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard · 2026-05-22 04:00

TIP：在策略内蒸馏中的 Token 重要性

arXiv:2604.14084v4 Announce Type: replace-cross Abstract: On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We as…
arXiv cs.CL TIER_1 English(EN) · Jie Sun, Mao Zheng, Mingyang Song, Qiyong Zhong, Yilin Cheng, Bichuan Feng, Pengfei Liu, Junfeng Fang, Xiang Wang · 2026-05-22 04:00

SimCT：为跨分词器策略内蒸馏恢复丢失的监督

arXiv:2605.07711v2 Announce Type: replace Abstract: On-policy distillation (OPD) is a standard tool for transferring teacher behavior to a smaller student, but it implicitly assumes that teacher and student predictions are comparable token by token, an assumption that fails whene…
arXiv cs.CL TIER_1 English(EN) · Yuchen Cai, Ding Cao, Liang Lin, Chunxi Luo, Xin Xu, Kai Yang, Weijie Liu, Saiyong Yang, Tianxiang Zhao, Guangzhong Sun, Guiquan Liu, Junfeng Fang · 2026-05-22 04:00

学习预见：揭示 on-policy distillation 的效率解锁

arXiv:2605.11739v3 Announce Type: replace Abstract: On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-lev…
arXiv cs.LG TIER_1 English(EN) · Xiaogeng Liu, Xinyan Wang, Yingzi Ma, Yechao Zhang, Chaowei Xiao · 2026-05-22 04:00

教师令牌何时可靠？用于推理的位置加权按策略自蒸馏

arXiv:2605.21606v1 Announce Type: new Abstract: On-policy self-distillation (OPSD) trains a student on its own rollouts using a privileged teacher, but its standard objective weights all generated tokens equally, implicitly treating the privileged teacher target as equally reliab…

报道来源 [6]

EDGE-OPD：通过证据引导的策略内蒸馏实现特权上下文的内部化

EDGE-OPD：通过证据引导的策略内蒸馏实现特权上下文的内部化

TIP：在策略内蒸馏中的 Token 重要性

SimCT：为跨分词器策略内蒸馏恢复丢失的监督

学习预见：揭示 on-policy distillation 的效率解锁

教师令牌何时可靠？用于推理的位置加权按策略自蒸馏

相关实体

相关话题