PulseAugur
EN
LIVE 21:36:57

Trust Region On-Policy Distillation

Researchers are exploring advanced techniques in on-policy distillation (OPD) for large language models to improve training stability and efficiency. Several papers introduce methods to refine how teacher models guide student models, focusing on selective learning, adaptive weighting, and better credit assignment. These approaches aim to overcome challenges like state-oblivious collapse, unreliable supervision signals, and the optimization of AI

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 48 sources. How we write summaries →

Trust Region On-Policy Distillation

COVERAGE [48]

  1. arXiv cs.LG TIER_1 English(EN) · Yifan Niu, Han Xiao, Dongyi Liu, Zelong Wang, Dihong Gong, Yasheng Wang, Jia Li ·

    Breaking the Tokenizer Barrier: On-Policy Distillation across Model Families

    arXiv:2606.09456v1 Announce Type: new Abstract: On-Policy Distillation (OPD) has become a core technique in the post-training of Large Language Models (LLMs) for transferring knowledge from domain experts to student models. However, existing OPD distillation methods require teach…

  2. arXiv cs.LG TIER_1 English(EN) · Haoran Xu, Hongyu Wang, Yifei Gao, Jiaze Li, Xiaofeng Zhang, Xiaosong Yuan ·

    SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling

    arXiv:2606.09304v1 Announce Type: cross Abstract: On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find th…

  3. arXiv cs.LG TIER_1 English(EN) · Haoran Xin, Anhao Zhao, Ying Sun, Jin Li, Xiaoyu Shen, Hui Xiong ·

    Escaping the KL Agreement Trap in On-Policy Distillation

    arXiv:2606.09471v1 Announce Type: new Abstract: On-policy distillation (OPD) provides dense token-level supervision by asking a teacher to score student-generated rollouts. However, when the student drifts into an unrecoverable prefix, the teacher may locally agree with the degra…

  4. arXiv cs.LG TIER_1 English(EN) · Dongze Hao, Zhiwei Jin, Chen Chen, Haonan Lu ·

    Stabilizing On-Policy Distillation for MLLM Reasoning with Global Normalization

    arXiv:2606.09091v1 Announce Type: new Abstract: On-policy distillation (OPD) has recently emerged as an important post-training paradigm. By using a stronger teacher model to provide dense, fine-grained supervision for sampled trajectories, OPD offers a clear advantage over reinf…

  5. arXiv cs.CL TIER_1 English(EN) · Hui Xiong ·

    Escaping the KL Agreement Trap in On-Policy Distillation

    On-policy distillation (OPD) provides dense token-level supervision by asking a teacher to score student-generated rollouts. However, when the student drifts into an unrecoverable prefix, the teacher may locally agree with the degraded state, producing low reverse KL but little c…

  6. arXiv cs.LG TIER_1 English(EN) · Jia Li ·

    Breaking the Tokenizer Barrier: On-Policy Distillation across Model Families

    On-Policy Distillation (OPD) has become a core technique in the post-training of Large Language Models (LLMs) for transferring knowledge from domain experts to student models. However, existing OPD distillation methods require teacher and student models to share the same tokenize…

  7. arXiv cs.CL TIER_1 English(EN) · Xiaosong Yuan ·

    SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling

    On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its effectiveness implicitly relies on two assu…

  8. arXiv cs.AI TIER_1 English(EN) · Shizhe Xiang, Ke An, Wenlong Yu, Yue Liu, Jian Luan, Pei Fu, Qilong Wang ·

    Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization

    arXiv:2606.07000v1 Announce Type: new Abstract: Recent post-training methods, particularly Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced the reasoning ability of Large Vision-Language Models (LVLMs). However, the sparse nature of verifiable re…

  9. arXiv cs.AI TIER_1 English(EN) · Zhennan Shen, Yanshu Li, Qingyu Yin, Chak Tou Leong, Zhilin Wang, Yanxu Chen, Rongduo Han, Sunbowen Lee, Yi R. Fung ·

    On the Geometry of On-Policy Distillation

    arXiv:2606.07082v1 Announce Type: cross Abstract: On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with …

  10. arXiv cs.LG TIER_1 English(EN) · Yi R. Fung ·

    On the Geometry of On-Policy Distillation

    On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with supervised fine-tuning (SFT) and reinforcement lea…

  11. arXiv cs.AI TIER_1 English(EN) · Qilong Wang ·

    Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization

    Recent post-training methods, particularly Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced the reasoning ability of Large Vision-Language Models (LVLMs). However, the sparse nature of verifiable rewards provides little token-level supervision fo…

  12. arXiv cs.LG TIER_1 English(EN) · Shenzhi Yang, Guangcheng Zhu, Bowen Song, Haobo Wang, Mingxuan Xia, Xing Zheng, Yingfan Ma, Zhongqi Chen, Weiqiang Wang, Gang Chen ·

    OPRD: On-Policy Representation Distillation

    arXiv:2606.06021v1 Announce Type: new Abstract: On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.…

  13. arXiv cs.LG TIER_1 English(EN) · Kanghui Tian, Siyuan Liu, Ziang Yan, Sheng Xia, Shuai Dong, Yi Wang ·

    ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

    arXiv:2606.05718v1 Announce Type: cross Abstract: On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that ob…

  14. Hugging Face Daily Papers TIER_1 English(EN) ·

    On the Geometry of On-Policy Distillation

    On-policy distillation exhibits distinct parameter space dynamics characterized by relaxed off-principal updates and subspace locking, forming a unique geometric pattern separate from supervised fine-tuning and reinforcement learning with verifiable rewards.

  15. Hugging Face Daily Papers TIER_1 English(EN) ·

    OPRD: On-Policy Representation Distillation

    On-Policy Representation Distillation (OPRD) improves upon traditional on-policy distillation by aligning student and teacher representations in hidden-state space rather than just output space, resulting in reduced variance and improved training efficiency.

  16. arXiv cs.AI TIER_1 English(EN) · Yuying Li, Leqi Zheng, Yongzi Yu, Wenrui Zhou, Xuchang Zhong, Xing Hu, Jing Jin, Huangjie Yuan, Tao Feng ·

    Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

    arXiv:2606.02684v1 Announce Type: cross Abstract: On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which …

  17. arXiv cs.AI TIER_1 English(EN) · Haowei Guo, Baolong Bi, Ruicheng Zhang, Bingqian Sun, Wentao Zhang ·

    When Should the Teacher Move? Temporal Coupling and Stability in Self On-Policy Distillation

    arXiv:2606.03532v1 Announce Type: cross Abstract: Self on-policy distillation trains a student policy against a teacher derived from its own parameter history, yet the teacher's update schedule -- which governs the \emph{temporal coupling} between teacher and student -- has not b…

  18. arXiv cs.LG TIER_1 English(EN) · Wentao Zhang ·

    When Should the Teacher Move? Temporal Coupling and Stability in Self On-Policy Distillation

    Self on-policy distillation trains a student policy against a teacher derived from its own parameter history, yet the teacher's update schedule -- which governs the \emph{temporal coupling} between teacher and student -- has not been systematically studied as a stability variable…

  19. arXiv cs.AI TIER_1 English(EN) · Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, Xunliang Cai ·

    SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

    arXiv:2604.10688v2 Announce Type: replace-cross Abstract: On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy …

  20. arXiv cs.AI TIER_1 English(EN) · Yuxuan Jiang, Runchao Li, Shubhashis Roy Dipta, Dawei Li, Zhao Yang ·

    Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

    arXiv:2605.09253v2 Announce Type: replace-cross Abstract: While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionately drives reasoning gains, an analogous token-level understanding of On-Policy Dis…

  21. arXiv cs.AI TIER_1 English(EN) · Yuxiao Yang, Xiaoyun Wang, Weitong Zhang ·

    OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

    arXiv:2605.12400v2 Announce Type: replace-cross Abstract: We study on-policy self-distillation (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite its promise, OPSD can suffer f…

  22. arXiv cs.AI TIER_1 English(EN) · Weichen Yu, Xiaomin Li, Yizhou Zhao, Xiaoze Liu, Ruowang Zhang, Haixin Wang, Yinyi Luo, Chen Henry Wu, Gaurav Mittal, Matt Fredrikson, Yu Hu ·

    Multi-Rollout On-Policy Distillation via Peer Successes and Failures

    arXiv:2605.12652v2 Announce Type: replace-cross Abstract: Large language models are often post-trained with sparse verifier rewards, which indicate whether a sampled trajectory succeeds but provide limited guidance about where reasoning succeeds or fails. On-policy distillation (…

  23. arXiv cs.CL TIER_1 English(EN) · Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li, Yehui Tang ·

    Trust Region On-Policy Distillation

    arXiv:2606.01249v1 Announce Type: cross Abstract: On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training b…

  24. arXiv cs.CL TIER_1 English(EN) · Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang, Peng Bo, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao ·

    OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

    arXiv:2606.01476v1 Announce Type: cross Abstract: On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and…

  25. arXiv cs.AI TIER_1 English(EN) · Hao Li, Jingkun An, Zijun Song, Pengyu Zhu, Rui Li, Hao Wang, Wendi Feng, Yesheng Liu, Lijun Li, Jin-Ge Yao, Lei Sha ·

    SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

    arXiv:2606.02530v1 Announce Type: new Abstract: Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing methods mitigate this by balancing dual objectives, which heavily rely on massive general-purpose …

  26. arXiv cs.AI TIER_1 English(EN) · Hanyang Zhao, Haoxian Chen, Han Lin, Genta Indra Winata, David Yao, Wenpin Tang ·

    OPD+: Rethinking the Advantage Design for On-Policy Distillation

    arXiv:2606.01039v1 Announce Type: cross Abstract: On-policy distillation (OPD) is a widely used technique to transfer capabilities from capable teacher language models to the base student models, and can be formulated in a reinforcement learning style objective using student gene…

  27. arXiv cs.AI TIER_1 English(EN) · Yuxuan Jiang, Francis Ferraro ·

    Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance

    arXiv:2606.00305v1 Announce Type: cross Abstract: On-Policy Distillation (OPD) improves large language model reasoning by training a student model on trajectories sampled from its own policy under teacher supervision. Although OPD operates on trajectories, its learning signal rem…

  28. arXiv cs.AI TIER_1 English(EN) · Lei Sha ·

    SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

    Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing methods mitigate this by balancing dual objectives, which heavily rely on massive general-purpose data or auxiliary reward models. In this paper, …

  29. arXiv cs.AI TIER_1 English(EN) · Daniil Plyusov, Alexey Gorbatovski, Alexey Malakhov, Nikita Balagansky, Boris Shaposhnikov, Daria Korotyshova, Daniil Gavrilov ·

    Trust-Region Behavior Blending for On-Policy Distillation

    arXiv:2605.31159v1 Announce Type: cross Abstract: On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, pla…

  30. arXiv cs.AI TIER_1 English(EN) · Yanjiang Liu, Jie Lou, Xinyan Guan, Yuqiu Ji, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Xing Yu, Yaojie Lu ·

    Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation

    arXiv:2605.30833v1 Announce Type: cross Abstract: On-policy distillation transfers reasoning capabilities by training a student model on its own generated trajectories using token-level feedback from a teacher. However, we identify a critical bottleneck, \textbf{Supervision Fidel…

  31. Hugging Face Daily Papers TIER_1 English(EN) ·

    Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

    FiRe-OPD improves on-policy distillation in large language models by filtering low-quality trajectories and applying soft reweighting to enhance informative token selection and optimization stability.

  32. Hugging Face Daily Papers TIER_1 English(EN) ·

    OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

    OmniOPD addresses limitations of standard On-Policy Distillation by using chunk-level semantic similarity instead of token-level logits, improving learning reliability and performance with black-box teachers.

  33. Hugging Face Daily Papers TIER_1 English(EN) ·

    Trust Region On-Policy Distillation

    Trust Region On-Policy Distillation (TrOPD) improves reliable token-level supervision in large language model distillation by using trust regions, outlier estimation, and off-policy guidance to address instability issues under distribution mismatch.

  34. arXiv cs.AI TIER_1 English(EN) · Daniil Gavrilov ·

    Trust-Region Behavior Blending for On-Policy Distillation

    On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality pr…

  35. arXiv cs.AI TIER_1 English(EN) · Tommy He, Jerome Sieber, Matteo Saponati ·

    A Predictive Law for On-Policy Self-Distillation From World Feedback

    arXiv:2605.30070v1 Announce Type: cross Abstract: Moving beyond simple scalar rewards toward richer world feedback is a natural path to more scalable RL post-training. On-policy self-distillation (OPSD) is a promising recent approach that uses arbitrary feedback as learning signa…

  36. arXiv cs.CL TIER_1 English(EN) · Haodi Lei, Yafy Li, Haoran Zhang, Shunkai Zhang, Qianjia Cheng, Xiaoye Qu, Ganqu Cui, Bowen Zhou, Ning Ding, Yun Luo, Yu Cheng ·

    Draft-OPD: On-Policy Distillation for Speculative Draft Models

    arXiv:2605.29343v1 Announce Type: new Abstract: Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is su…

  37. Hugging Face Daily Papers TIER_1 English(EN) ·

    Trust-Region Behavior Blending for On-Policy Distillation

    Trust-Region behavior Blending improves on-policy distillation by replacing early poor-quality student rollouts with teacher-like behavior within a KL trust region during warmup.

  38. arXiv cs.AI TIER_1 English(EN) · Matteo Saponati ·

    A Predictive Law for On-Policy Self-Distillation From World Feedback

    Moving beyond simple scalar rewards toward richer world feedback is a natural path to more scalable RL post-training. On-policy self-distillation (OPSD) is a promising recent approach that uses arbitrary feedback as learning signal, yet its reliability compared to established met…

  39. arXiv cs.AI TIER_1 English(EN) · Kun Liang, Chenming Tang, Clive Bai, Weijie Liu, Saiyong Yang, Yunfang Wu ·

    ADWIN: Adaptive Windows for Horizon-Aware On-Policy Distillation

    arXiv:2605.28396v1 Announce Type: cross Abstract: On-policy distillation (OPD) transfers reasoning behavior by training a student on teacher feedback along student-generated trajectories, but standard full-rollout training ties every update to a costly completion and can over-all…

  40. arXiv cs.LG TIER_1 English(EN) · Yunfang Wu ·

    ADWIN: Adaptive Windows for Horizon-Aware On-Policy Distillation

    On-policy distillation (OPD) transfers reasoning behavior by training a student on teacher feedback along student-generated trajectories, but standard full-rollout training ties every update to a costly completion and can over-allocate supervision to late positions with low margi…

  41. arXiv cs.AI TIER_1 English(EN) · Zhou Ziheng, Jiaqi Li, Huacong Tang, Ying Nian Wu, Demetri Terzopoulos ·

    Less is More: Early Stopping Rollout for On-Policy Distillation

    arXiv:2605.27028v1 Announce Type: cross Abstract: On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' pro…

  42. arXiv cs.AI TIER_1 English(EN) · Demetri Terzopoulos ·

    Less is More: Early Stopping Rollout for On-Policy Distillation

    On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' problem in this paradigm: for the later tokens, with …

  43. Hugging Face Daily Papers TIER_1 English(EN) ·

    Less is More: Early Stopping Rollout for On-Policy Distillation

    On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' problem in this paradigm: for the later tokens, with …

  44. arXiv cs.AI TIER_1 English(EN) · Siqi Zhu, Xuyan Ye, Hongyu Lu, Weiye Shi, Ge Liu ·

    The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

    arXiv:2605.11182v2 Announce Type: replace Abstract: On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for large language models, offering dense token-level supervision on trajectories sampled from the model's own p…

  45. Hugging Face Daily Papers TIER_1 English(EN) ·

    Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

    Token-level teacher signals in on-policy distillation are better predicted by teachability—measuring local compatibility between teacher and student distributions—than by raw KL disagreement alone.

  46. Hugging Face Daily Papers TIER_1 English(EN) ·

    Less is More: Early Stopping Rollout for On-Policy Distillation

    On-policy distillation suffers from teacher decay issues with later tokens, which are mitigated by Early Stopping Rollout that restricts training to initial response tokens, improving efficiency and stability.

  47. arXiv cs.CV TIER_1 English(EN) · Yi Wang ·

    ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

    On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that observes training-time-only signals such as referenc…

  48. r/MachineLearning TIER_1 English(EN) · /u/NielsRogge ·

    On-policy distillation: one of the hottest terms on PapersWithCode [R]

    <table> <tr><td> <a href="https://www.reddit.com/r/MachineLearning/comments/1twmhud/onpolicy_distillation_one_of_the_hottest_terms_on/"> <img alt="On-policy distillation: one of the hottest terms on PapersWithCode [R]" src="https://preview.redd.it/yegq2gfag95h1.png?width=140&amp;…