PulseAugur
实时 23:05:31

New methods enhance on-policy distillation for LLM training

Researchers have developed new methods to improve on-policy distillation (OPD), a technique for training smaller language models using larger ones. One approach, TIP, identifies informative tokens by analyzing student entropy and teacher-student divergence, achieving significant memory reduction and performance gains. Another method, SimCT, addresses issues with different tokenizers by expanding the supervision space to include multi-token continuations, recovering lost signal and improving performance on reasoning and code generation tasks. Additionally, EffOPD accelerates OPD training by optimizing update trajectories and module allocation, leading to a threefold speedup. AI

影响 These research advancements offer more efficient and effective ways to train smaller language models, potentially reducing computational costs and improving performance on complex reasoning tasks.

排序理由 The cluster contains multiple academic papers detailing new methods and theoretical insights into on-policy distillation for large language models.

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 6 个来源。 我们如何撰写摘要 →

报道来源 [6]

  1. arXiv cs.AI TIER_1 English(EN) · Aristotelis Lazaridis, Dylan Bates, Aman Sharma, Brian King, Vincent Lu, Jack FitzGerald ·

    EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

    arXiv:2605.23493v1 Announce Type: new Abstract: On-Policy Distillation (OPD) has gained wide attraction as an LLM post-training paradigm due to its effectiveness in improving capabilities without introducing model distribution drift, and consequently, regression in general tasks.…

  2. arXiv cs.AI TIER_1 English(EN) · Jack FitzGerald ·

    EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

    On-Policy Distillation (OPD) has gained wide attraction as an LLM post-training paradigm due to its effectiveness in improving capabilities without introducing model distribution drift, and consequently, regression in general tasks. On-Policy Self-Distillation (OPSD) is an effici…

  3. arXiv cs.AI TIER_1 English(EN) · Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard ·

    TIP: Token Importance in On-Policy Distillation

    arXiv:2604.14084v4 Announce Type: replace-cross Abstract: On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We as…

  4. arXiv cs.CL TIER_1 English(EN) · Jie Sun, Mao Zheng, Mingyang Song, Qiyong Zhong, Yilin Cheng, Bichuan Feng, Pengfei Liu, Junfeng Fang, Xiang Wang ·

    SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

    arXiv:2605.07711v2 Announce Type: replace Abstract: On-policy distillation (OPD) is a standard tool for transferring teacher behavior to a smaller student, but it implicitly assumes that teacher and student predictions are comparable token by token, an assumption that fails whene…

  5. arXiv cs.CL TIER_1 English(EN) · Yuchen Cai, Ding Cao, Liang Lin, Chunxi Luo, Xin Xu, Kai Yang, Weijie Liu, Saiyong Yang, Tianxiang Zhao, Guangzhong Sun, Guiquan Liu, Junfeng Fang ·

    Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

    arXiv:2605.11739v3 Announce Type: replace Abstract: On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-lev…

  6. arXiv cs.LG TIER_1 English(EN) · Xiaogeng Liu, Xinyan Wang, Yingzi Ma, Yechao Zhang, Chaowei Xiao ·

    When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

    arXiv:2605.21606v1 Announce Type: new Abstract: On-policy self-distillation (OPSD) trains a student on its own rollouts using a privileged teacher, but its standard objective weights all generated tokens equally, implicitly treating the privileged teacher target as equally reliab…