Researchers are exploring advanced techniques in on-policy distillation (OPD) for large language models to improve training stability and efficiency. Several papers introduce methods to refine how teacher models guide student models, focusing on selective learning, adaptive weighting, and better credit assignment. These approaches aim to overcome challenges like state-oblivious collapse, unreliable supervision signals, and the optimization of
AI
arXiv:2606.09456v1 Announce Type: new Abstract: On-Policy Distillation (OPD) has become a core technique in the post-training of Large Language Models (LLMs) for transferring knowledge from domain experts to student models. However, existing OPD distillation methods require teach…
arXiv:2606.09304v1 Announce Type: cross Abstract: On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find th…
arXiv:2606.09471v1 Announce Type: new Abstract: On-policy distillation (OPD) provides dense token-level supervision by asking a teacher to score student-generated rollouts. However, when the student drifts into an unrecoverable prefix, the teacher may locally agree with the degra…
arXiv:2606.09091v1 Announce Type: new Abstract: On-policy distillation (OPD) has recently emerged as an important post-training paradigm. By using a stronger teacher model to provide dense, fine-grained supervision for sampled trajectories, OPD offers a clear advantage over reinf…
On-policy distillation (OPD) provides dense token-level supervision by asking a teacher to score student-generated rollouts. However, when the student drifts into an unrecoverable prefix, the teacher may locally agree with the degraded state, producing low reverse KL but little c…
On-Policy Distillation (OPD) has become a core technique in the post-training of Large Language Models (LLMs) for transferring knowledge from domain experts to student models. However, existing OPD distillation methods require teacher and student models to share the same tokenize…
On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its effectiveness implicitly relies on two assu…
arXiv cs.AI
TIER_1English(EN)·Shizhe Xiang, Ke An, Wenlong Yu, Yue Liu, Jian Luan, Pei Fu, Qilong Wang·
arXiv:2606.07000v1 Announce Type: new Abstract: Recent post-training methods, particularly Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced the reasoning ability of Large Vision-Language Models (LVLMs). However, the sparse nature of verifiable re…
arXiv cs.AI
TIER_1English(EN)·Zhennan Shen, Yanshu Li, Qingyu Yin, Chak Tou Leong, Zhilin Wang, Yanxu Chen, Rongduo Han, Sunbowen Lee, Yi R. Fung·
arXiv:2606.07082v1 Announce Type: cross Abstract: On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with …
On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with supervised fine-tuning (SFT) and reinforcement lea…
Recent post-training methods, particularly Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced the reasoning ability of Large Vision-Language Models (LVLMs). However, the sparse nature of verifiable rewards provides little token-level supervision fo…
arXiv:2606.06021v1 Announce Type: new Abstract: On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.…
arXiv:2606.05718v1 Announce Type: cross Abstract: On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that ob…
On-policy distillation exhibits distinct parameter space dynamics characterized by relaxed off-principal updates and subspace locking, forming a unique geometric pattern separate from supervised fine-tuning and reinforcement learning with verifiable rewards.
On-Policy Representation Distillation (OPRD) improves upon traditional on-policy distillation by aligning student and teacher representations in hidden-state space rather than just output space, resulting in reduced variance and improved training efficiency.
arXiv:2606.02684v1 Announce Type: cross Abstract: On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which …
arXiv:2606.03532v1 Announce Type: cross Abstract: Self on-policy distillation trains a student policy against a teacher derived from its own parameter history, yet the teacher's update schedule -- which governs the \emph{temporal coupling} between teacher and student -- has not b…
Self on-policy distillation trains a student policy against a teacher derived from its own parameter history, yet the teacher's update schedule -- which governs the \emph{temporal coupling} between teacher and student -- has not been systematically studied as a stability variable…
arXiv:2604.10688v2 Announce Type: replace-cross Abstract: On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy …
arXiv:2605.09253v2 Announce Type: replace-cross Abstract: While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionately drives reasoning gains, an analogous token-level understanding of On-Policy Dis…
arXiv:2605.12400v2 Announce Type: replace-cross Abstract: We study on-policy self-distillation (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite its promise, OPSD can suffer f…
arXiv:2605.12652v2 Announce Type: replace-cross Abstract: Large language models are often post-trained with sparse verifier rewards, which indicate whether a sampled trajectory succeeds but provide limited guidance about where reasoning succeeds or fails. On-policy distillation (…
arXiv:2606.01249v1 Announce Type: cross Abstract: On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training b…
arXiv:2606.01476v1 Announce Type: cross Abstract: On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and…
arXiv:2606.02530v1 Announce Type: new Abstract: Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing methods mitigate this by balancing dual objectives, which heavily rely on massive general-purpose …
arXiv cs.AI
TIER_1English(EN)·Hanyang Zhao, Haoxian Chen, Han Lin, Genta Indra Winata, David Yao, Wenpin Tang·
arXiv:2606.01039v1 Announce Type: cross Abstract: On-policy distillation (OPD) is a widely used technique to transfer capabilities from capable teacher language models to the base student models, and can be formulated in a reinforcement learning style objective using student gene…
arXiv cs.AI
TIER_1English(EN)·Yuxuan Jiang, Francis Ferraro·
arXiv:2606.00305v1 Announce Type: cross Abstract: On-Policy Distillation (OPD) improves large language model reasoning by training a student model on trajectories sampled from its own policy under teacher supervision. Although OPD operates on trajectories, its learning signal rem…
Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing methods mitigate this by balancing dual objectives, which heavily rely on massive general-purpose data or auxiliary reward models. In this paper, …
arXiv:2605.31159v1 Announce Type: cross Abstract: On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, pla…
arXiv cs.AI
TIER_1English(EN)·Yanjiang Liu, Jie Lou, Xinyan Guan, Yuqiu Ji, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Xing Yu, Yaojie Lu·
arXiv:2605.30833v1 Announce Type: cross Abstract: On-policy distillation transfers reasoning capabilities by training a student model on its own generated trajectories using token-level feedback from a teacher. However, we identify a critical bottleneck, \textbf{Supervision Fidel…
FiRe-OPD improves on-policy distillation in large language models by filtering low-quality trajectories and applying soft reweighting to enhance informative token selection and optimization stability.
OmniOPD addresses limitations of standard On-Policy Distillation by using chunk-level semantic similarity instead of token-level logits, improving learning reliability and performance with black-box teachers.
Trust Region On-Policy Distillation (TrOPD) improves reliable token-level supervision in large language model distillation by using trust regions, outlier estimation, and off-policy guidance to address instability issues under distribution mismatch.
On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality pr…
arXiv:2605.30070v1 Announce Type: cross Abstract: Moving beyond simple scalar rewards toward richer world feedback is a natural path to more scalable RL post-training. On-policy self-distillation (OPSD) is a promising recent approach that uses arbitrary feedback as learning signa…
arXiv:2605.29343v1 Announce Type: new Abstract: Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is su…
Trust-Region behavior Blending improves on-policy distillation by replacing early poor-quality student rollouts with teacher-like behavior within a KL trust region during warmup.
Moving beyond simple scalar rewards toward richer world feedback is a natural path to more scalable RL post-training. On-policy self-distillation (OPSD) is a promising recent approach that uses arbitrary feedback as learning signal, yet its reliability compared to established met…
arXiv:2605.28396v1 Announce Type: cross Abstract: On-policy distillation (OPD) transfers reasoning behavior by training a student on teacher feedback along student-generated trajectories, but standard full-rollout training ties every update to a costly completion and can over-all…
On-policy distillation (OPD) transfers reasoning behavior by training a student on teacher feedback along student-generated trajectories, but standard full-rollout training ties every update to a costly completion and can over-allocate supervision to late positions with low margi…
arXiv:2605.27028v1 Announce Type: cross Abstract: On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' pro…
On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' problem in this paradigm: for the later tokens, with …
On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' problem in this paradigm: for the later tokens, with …
arXiv cs.AI
TIER_1English(EN)·Siqi Zhu, Xuyan Ye, Hongyu Lu, Weiye Shi, Ge Liu·
arXiv:2605.11182v2 Announce Type: replace Abstract: On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for large language models, offering dense token-level supervision on trajectories sampled from the model's own p…
Token-level teacher signals in on-policy distillation are better predicted by teachability—measuring local compatibility between teacher and student distributions—than by raw KL disagreement alone.
On-policy distillation suffers from teacher decay issues with later tokens, which are mitigated by Early Stopping Rollout that restricts training to initial response tokens, improving efficiency and stability.
On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that observes training-time-only signals such as referenc…
<table> <tr><td> <a href="https://www.reddit.com/r/MachineLearning/comments/1twmhud/onpolicy_distillation_one_of_the_hottest_terms_on/"> <img alt="On-policy distillation: one of the hottest terms on PapersWithCode [R]" src="https://preview.redd.it/yegq2gfag95h1.png?width=140&…