PulseAugur
LIVE 23:25:24
research · [4 sources] ·

New methods enhance on-policy distillation for LLM training

Researchers have developed new methods to improve on-policy distillation (OPD), a technique for training smaller language models using larger ones. One approach, TIP, identifies informative tokens by analyzing student entropy and teacher-student divergence, achieving significant memory reduction and performance gains. Another method, SimCT, addresses issues with different tokenizers by expanding the supervision space to include multi-token continuations, recovering lost signal and improving performance on reasoning and code generation tasks. Additionally, EffOPD accelerates OPD training by optimizing update trajectories and module allocation, leading to a threefold speedup. AI

Summary written by gemini-2.5-flash-lite from 4 sources. How we write summaries →

IMPACT These research advancements offer more efficient and effective ways to train smaller language models, potentially reducing computational costs and improving performance on complex reasoning tasks.

RANK_REASON The cluster contains multiple academic papers detailing new methods and theoretical insights into on-policy distillation for large language models.

Read on arXiv cs.AI →

COVERAGE [4]

  1. arXiv cs.AI TIER_1 · Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard ·

    TIP: Token Importance in On-Policy Distillation

    arXiv:2604.14084v4 Announce Type: replace-cross Abstract: On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We as…

  2. arXiv cs.CL TIER_1 · Jie Sun, Mao Zheng, Mingyang Song, Qiyong Zhong, Yilin Cheng, Bichuan Feng, Pengfei Liu, Junfeng Fang, Xiang Wang ·

    SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

    arXiv:2605.07711v2 Announce Type: replace Abstract: On-policy distillation (OPD) is a standard tool for transferring teacher behavior to a smaller student, but it implicitly assumes that teacher and student predictions are comparable token by token, an assumption that fails whene…

  3. arXiv cs.CL TIER_1 · Yuchen Cai, Ding Cao, Liang Lin, Chunxi Luo, Xin Xu, Kai Yang, Weijie Liu, Saiyong Yang, Tianxiang Zhao, Guangzhong Sun, Guiquan Liu, Junfeng Fang ·

    Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

    arXiv:2605.11739v3 Announce Type: replace Abstract: On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-lev…

  4. arXiv cs.LG TIER_1 · Xiaogeng Liu, Xinyan Wang, Yingzi Ma, Yechao Zhang, Chaowei Xiao ·

    When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

    arXiv:2605.21606v1 Announce Type: new Abstract: On-policy self-distillation (OPSD) trains a student on its own rollouts using a privileged teacher, but its standard objective weights all generated tokens equally, implicitly treating the privileged teacher target as equally reliab…