PulseAugur
EN
LIVE 19:43:44

New methods enhance on-policy distillation for LLM training

Researchers have developed new methods to improve on-policy distillation (OPD), a technique for training smaller language models using larger ones. One approach, TIP, identifies informative tokens by analyzing student entropy and teacher-student divergence, achieving significant memory reduction and performance gains. Another method, SimCT, addresses issues with different tokenizers by expanding the supervision space to include multi-token continuations, recovering lost signal and improving performance on reasoning and code generation tasks. Additionally, EffOPD accelerates OPD training by optimizing update trajectories and module allocation, leading to a threefold speedup. AI

IMPACT These research advancements offer more efficient and effective ways to train smaller language models, potentially reducing computational costs and improving performance on complex reasoning tasks.

RANK_REASON The cluster contains multiple academic papers detailing new methods and theoretical insights into on-policy distillation for large language models.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 6 sources. How we write summaries →

COVERAGE [6]

  1. arXiv cs.AI TIER_1 · Aristotelis Lazaridis, Dylan Bates, Aman Sharma, Brian King, Vincent Lu, Jack FitzGerald ·

    EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

    arXiv:2605.23493v1 Announce Type: new Abstract: On-Policy Distillation (OPD) has gained wide attraction as an LLM post-training paradigm due to its effectiveness in improving capabilities without introducing model distribution drift, and consequently, regression in general tasks.…

  2. arXiv cs.AI TIER_1 · Jack FitzGerald ·

    EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

    On-Policy Distillation (OPD) has gained wide attraction as an LLM post-training paradigm due to its effectiveness in improving capabilities without introducing model distribution drift, and consequently, regression in general tasks. On-Policy Self-Distillation (OPSD) is an effici…

  3. arXiv cs.AI TIER_1 · Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard ·

    TIP: Token Importance in On-Policy Distillation

    arXiv:2604.14084v4 Announce Type: replace-cross Abstract: On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We as…

  4. arXiv cs.CL TIER_1 · Jie Sun, Mao Zheng, Mingyang Song, Qiyong Zhong, Yilin Cheng, Bichuan Feng, Pengfei Liu, Junfeng Fang, Xiang Wang ·

    SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

    arXiv:2605.07711v2 Announce Type: replace Abstract: On-policy distillation (OPD) is a standard tool for transferring teacher behavior to a smaller student, but it implicitly assumes that teacher and student predictions are comparable token by token, an assumption that fails whene…

  5. arXiv cs.CL TIER_1 · Yuchen Cai, Ding Cao, Liang Lin, Chunxi Luo, Xin Xu, Kai Yang, Weijie Liu, Saiyong Yang, Tianxiang Zhao, Guangzhong Sun, Guiquan Liu, Junfeng Fang ·

    Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

    arXiv:2605.11739v3 Announce Type: replace Abstract: On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-lev…

  6. arXiv cs.LG TIER_1 · Xiaogeng Liu, Xinyan Wang, Yingzi Ma, Yechao Zhang, Chaowei Xiao ·

    When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

    arXiv:2605.21606v1 Announce Type: new Abstract: On-policy self-distillation (OPSD) trains a student on its own rollouts using a privileged teacher, but its standard objective weights all generated tokens equally, implicitly treating the privileged teacher target as equally reliab…