New methods enhance on-policy distillation for LLM training

By PulseAugur Editorial · [6 sources] · 2026-05-22 04:00

Researchers have developed new methods to improve on-policy distillation (OPD), a technique for training smaller language models using larger ones. One approach, TIP, identifies informative tokens by analyzing student entropy and teacher-student divergence, achieving significant memory reduction and performance gains. Another method, SimCT, addresses issues with different tokenizers by expanding the supervision space to include multi-token continuations, recovering lost signal and improving performance on reasoning and code generation tasks. Additionally, EffOPD accelerates OPD training by optimizing update trajectories and module allocation, leading to a threefold speedup. AI

IMPACT These research advancements offer more efficient and effective ways to train smaller language models, potentially reducing computational costs and improving performance on complex reasoning tasks.

RANK_REASON The cluster contains multiple academic papers detailing new methods and theoretical insights into on-policy distillation for large language models.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 6 sources. How we write summaries →

COVERAGE [6]

arXiv cs.AI TIER_1 · Aristotelis Lazaridis, Dylan Bates, Aman Sharma, Brian King, Vincent Lu, Jack FitzGerald · 2026-05-25 04:00

EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

arXiv:2605.23493v1 Announce Type: new Abstract: On-Policy Distillation (OPD) has gained wide attraction as an LLM post-training paradigm due to its effectiveness in improving capabilities without introducing model distribution drift, and consequently, regression in general tasks.…
arXiv cs.AI TIER_1 · Jack FitzGerald · 2026-05-22 10:55

EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

On-Policy Distillation (OPD) has gained wide attraction as an LLM post-training paradigm due to its effectiveness in improving capabilities without introducing model distribution drift, and consequently, regression in general tasks. On-Policy Self-Distillation (OPSD) is an effici…
arXiv cs.AI TIER_1 · Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard · 2026-05-22 04:00

TIP: Token Importance in On-Policy Distillation

arXiv:2604.14084v4 Announce Type: replace-cross Abstract: On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We as…
arXiv cs.CL TIER_1 · Jie Sun, Mao Zheng, Mingyang Song, Qiyong Zhong, Yilin Cheng, Bichuan Feng, Pengfei Liu, Junfeng Fang, Xiang Wang · 2026-05-22 04:00

SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

arXiv:2605.07711v2 Announce Type: replace Abstract: On-policy distillation (OPD) is a standard tool for transferring teacher behavior to a smaller student, but it implicitly assumes that teacher and student predictions are comparable token by token, an assumption that fails whene…
arXiv cs.CL TIER_1 · Yuchen Cai, Ding Cao, Liang Lin, Chunxi Luo, Xin Xu, Kai Yang, Weijie Liu, Saiyong Yang, Tianxiang Zhao, Guangzhong Sun, Guiquan Liu, Junfeng Fang · 2026-05-22 04:00

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

arXiv:2605.11739v3 Announce Type: replace Abstract: On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-lev…
arXiv cs.LG TIER_1 · Xiaogeng Liu, Xinyan Wang, Yingzi Ma, Yechao Zhang, Chaowei Xiao · 2026-05-22 04:00

When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

arXiv:2605.21606v1 Announce Type: new Abstract: On-policy self-distillation (OPSD) trains a student on its own rollouts using a privileged teacher, but its standard objective weights all generated tokens equally, implicitly treating the privileged teacher target as equally reliab…

COVERAGE [6]

EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

TIP: Token Importance in On-Policy Distillation

SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

RELATED ENTITIES

RELATED TOPICS