PulseAugur
实时 02:59:16

Trust Region On-Policy Distillation

研究人员正在探索用于大型语言模型的策略内蒸馏(OPD)的高级技术,以提高训练稳定性和效率。几篇论文介绍了改进教师模型指导学生模型的方法,重点关注选择性学习、自适应加权和更好的信用分配。这些方法旨在克服状态无关崩溃、不可靠的监督信号和优化等挑战。 AI

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 48 个来源。 我们如何撰写摘要 →

Trust Region On-Policy Distillation

报道来源 [48]

  1. arXiv cs.LG TIER_1 English(EN) · Yifan Niu, Han Xiao, Dongyi Liu, Zelong Wang, Dihong Gong, Yasheng Wang, Jia Li ·

    突破分词器壁垒:跨模型家族的 on-policy 蒸馏

    arXiv:2606.09456v1 Announce Type: new Abstract: On-Policy Distillation (OPD) has become a core technique in the post-training of Large Language Models (LLMs) for transferring knowledge from domain experts to student models. However, existing OPD distillation methods require teach…

  2. arXiv cs.LG TIER_1 English(EN) · Haoran Xu, Hongyu Wang, Yifei Gao, Jiaze Li, Xiaofeng Zhang, Xiaosong Yuan ·

    SG-OPD:通过符号一致性门控和分阶段教师采样实现符号门控策略蒸馏

    arXiv:2606.09304v1 Announce Type: cross Abstract: On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find th…

  3. arXiv cs.LG TIER_1 English(EN) · Haoran Xin, Anhao Zhao, Ying Sun, Jin Li, Xiaoyu Shen, Hui Xiong ·

    摆脱 on-policy 蒸馏中的 KL 协议陷阱

    arXiv:2606.09471v1 Announce Type: new Abstract: On-policy distillation (OPD) provides dense token-level supervision by asking a teacher to score student-generated rollouts. However, when the student drifts into an unrecoverable prefix, the teacher may locally agree with the degra…

  4. arXiv cs.LG TIER_1 English(EN) · Dongze Hao, Zhiwei Jin, Chen Chen, Haonan Lu ·

    使用全局归一化稳定MLLM推理的On-Policy蒸馏

    arXiv:2606.09091v1 Announce Type: new Abstract: On-policy distillation (OPD) has recently emerged as an important post-training paradigm. By using a stronger teacher model to provide dense, fine-grained supervision for sampled trajectories, OPD offers a clear advantage over reinf…

  5. arXiv cs.CL TIER_1 English(EN) · Hui Xiong ·

    摆脱On-Policy蒸馏中的KL协议陷阱

    On-policy distillation (OPD) provides dense token-level supervision by asking a teacher to score student-generated rollouts. However, when the student drifts into an unrecoverable prefix, the teacher may locally agree with the degraded state, producing low reverse KL but little c…

  6. arXiv cs.LG TIER_1 English(EN) · Jia Li ·

    突破分词器壁垒:跨模型家族的 on-policy 蒸馏

    On-Policy Distillation (OPD) has become a core technique in the post-training of Large Language Models (LLMs) for transferring knowledge from domain experts to student models. However, existing OPD distillation methods require teacher and student models to share the same tokenize…

  7. arXiv cs.CL TIER_1 English(EN) · Xiaosong Yuan ·

    SG-OPD:通过符号一致性门控和分阶段教师采样实现符号门控策略蒸馏

    On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its effectiveness implicitly relies on two assu…

  8. arXiv cs.AI TIER_1 English(EN) · Shizhe Xiang, Ke An, Wenlong Yu, Yue Liu, Jian Luan, Pei Fu, Qilong Wang ·

    传授方法而非答案:用于多模态策略优化的特权辅导蒸馏

    arXiv:2606.07000v1 Announce Type: new Abstract: Recent post-training methods, particularly Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced the reasoning ability of Large Vision-Language Models (LVLMs). However, the sparse nature of verifiable re…

  9. arXiv cs.AI TIER_1 English(EN) · Zhennan Shen, Yanshu Li, Qingyu Yin, Chak Tou Leong, Zhilin Wang, Yanxu Chen, Rongduo Han, Sunbowen Lee, Yi R. Fung ·

    关于在线策略蒸馏的几何学

    arXiv:2606.07082v1 Announce Type: cross Abstract: On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with …

  10. arXiv cs.LG TIER_1 English(EN) · Yi R. Fung ·

    关于在线策略蒸馏的几何学

    On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with supervised fine-tuning (SFT) and reinforcement lea…

  11. arXiv cs.AI TIER_1 English(EN) · Qilong Wang ·

    传授方法而非答案:面向多模态策略优化的特权辅导蒸馏

    Recent post-training methods, particularly Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced the reasoning ability of Large Vision-Language Models (LVLMs). However, the sparse nature of verifiable rewards provides little token-level supervision fo…

  12. arXiv cs.LG TIER_1 English(EN) · Shenzhi Yang, Guangcheng Zhu, Bowen Song, Haobo Wang, Mingxuan Xia, Xing Zheng, Yingfan Ma, Zhongqi Chen, Weiqiang Wang, Gang Chen ·

    OPRD:On-Policy Representation Distillation

    arXiv:2606.06021v1 Announce Type: new Abstract: On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.…

  13. arXiv cs.LG TIER_1 English(EN) · Kanghui Tian, Siyuan Liu, Ziang Yan, Sheng Xia, Shuai Dong, Yi Wang ·

    ViCuR:视觉线索作为多模态按策略蒸馏的可恢复特权

    arXiv:2606.05718v1 Announce Type: cross Abstract: On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that ob…

  14. Hugging Face Daily Papers TIER_1 English(EN) ·

    关于在线策略蒸馏的几何学

    On-policy distillation exhibits distinct parameter space dynamics characterized by relaxed off-principal updates and subspace locking, forming a unique geometric pattern separate from supervised fine-tuning and reinforcement learning with verifiable rewards.

  15. Hugging Face Daily Papers TIER_1 English(EN) ·

    OPRD:On-Policy Representation Distillation

    On-Policy Representation Distillation (OPRD) improves upon traditional on-policy distillation by aligning student and teacher representations in hidden-state space rather than just output space, resulting in reduced variance and improved training efficiency.

  16. arXiv cs.AI TIER_1 English(EN) · Yuying Li, Leqi Zheng, Yongzi Yu, Wenrui Zhou, Xuchang Zhong, Xing Hu, Jing Jin, Huangjie Yuan, Tao Feng ·

    过滤后加权:重新思考在线策略蒸馏中的优化粒度

    arXiv:2606.02684v1 Announce Type: cross Abstract: On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which …

  17. arXiv cs.AI TIER_1 English(EN) · Haowei Guo, Baolong Bi, Ruicheng Zhang, Bingqian Sun, Wentao Zhang ·

    老师何时该移动?自策略蒸馏中的时间耦合与稳定性

    arXiv:2606.03532v1 Announce Type: cross Abstract: Self on-policy distillation trains a student policy against a teacher derived from its own parameter history, yet the teacher's update schedule -- which governs the \emph{temporal coupling} between teacher and student -- has not b…

  18. arXiv cs.LG TIER_1 English(EN) · Wentao Zhang ·

    老师何时该移动?自策略蒸馏中的时间耦合与稳定性

    Self on-policy distillation trains a student policy against a teacher derived from its own parameter history, yet the teacher's update schedule -- which governs the \emph{temporal coupling} between teacher and student -- has not been systematically studied as a stability variable…

  19. arXiv cs.AI TIER_1 English(EN) · Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, Xunliang Cai ·

    SCOPE:信号校准的策略内蒸馏增强与双路径自适应加权

    arXiv:2604.10688v2 Announce Type: replace-cross Abstract: On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy …

  20. arXiv cs.AI TIER_1 English(EN) · Yuxuan Jiang, Runchao Li, Shubhashis Roy Dipta, Dawei Li, Zhao Yang ·

    基石还是绊脚石?解读On-Policy Distillation中的Rock Tokens

    arXiv:2605.09253v2 Announce Type: replace-cross Abstract: While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionately drives reasoning gains, an analogous token-level understanding of On-Policy Dis…

  21. arXiv cs.AI TIER_1 English(EN) · Yuxiao Yang, Xiaoyun Wang, Weitong Zhang ·

    OGLS-SD:基于策略的自蒸馏与结果引导的logit引导用于LLM推理

    arXiv:2605.12400v2 Announce Type: replace-cross Abstract: We study on-policy self-distillation (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite its promise, OPSD can suffer f…

  22. arXiv cs.AI TIER_1 English(EN) · Weichen Yu, Xiaomin Li, Yizhou Zhao, Xiaoze Liu, Ruowang Zhang, Haixin Wang, Yinyi Luo, Chen Henry Wu, Gaurav Mittal, Matt Fredrikson, Yu Hu ·

    通过同伴的成功与失败进行多轮策略蒸馏

    arXiv:2605.12652v2 Announce Type: replace-cross Abstract: Large language models are often post-trained with sparse verifier rewards, which indicate whether a sampled trajectory succeeds but provide limited guidance about where reasoning succeeds or fails. On-policy distillation (…

  23. arXiv cs.CL TIER_1 English(EN) · Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li, Yehui Tang ·

    Trust Region On-Policy Distillation

    arXiv:2606.01249v1 Announce Type: cross Abstract: On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training b…

  24. arXiv cs.CL TIER_1 English(EN) · Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang, Peng Bo, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao ·

    OmniOPD:通过推测性验证实现无 Logit 的 On-Policy 蒸馏

    arXiv:2606.01476v1 Announce Type: cross Abstract: On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and…

  25. arXiv cs.AI TIER_1 English(EN) · Hao Li, Jingkun An, Zijun Song, Pengyu Zhu, Rui Li, Hao Wang, Wendi Feng, Yesheng Liu, Lijun Li, Jin-Ge Yao, Lei Sha ·

    SafeSteer:面向高效安全对齐的本地化在线策略蒸馏

    arXiv:2606.02530v1 Announce Type: new Abstract: Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing methods mitigate this by balancing dual objectives, which heavily rely on massive general-purpose …

  26. arXiv cs.AI TIER_1 English(EN) · Hanyang Zhao, Haoxian Chen, Han Lin, Genta Indra Winata, David Yao, Wenpin Tang ·

    OPD+: 重新思考 on-policy 蒸馏的优势设计

    arXiv:2606.01039v1 Announce Type: cross Abstract: On-policy distillation (OPD) is a widely used technique to transfer capabilities from capable teacher language models to the base student models, and can be formulated in a reinforcement learning style objective using student gene…

  27. arXiv cs.AI TIER_1 English(EN) · Yuxuan Jiang, Francis Ferraro ·

    通过近未来引导实现策略内蒸馏中的推理轨迹桥接

    arXiv:2606.00305v1 Announce Type: cross Abstract: On-Policy Distillation (OPD) improves large language model reasoning by training a student model on trajectories sampled from its own policy under teacher supervision. Although OPD operates on trajectories, its learning signal rem…

  28. arXiv cs.AI TIER_1 English(EN) · Lei Sha ·

    SafeSteer:面向高效安全对齐的本地化在线策略蒸馏

    Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing methods mitigate this by balancing dual objectives, which heavily rely on massive general-purpose data or auxiliary reward models. In this paper, …

  29. arXiv cs.AI TIER_1 English(EN) · Daniil Plyusov, Alexey Gorbatovski, Alexey Malakhov, Nikita Balagansky, Boris Shaposhnikov, Daria Korotyshova, Daniil Gavrilov ·

    用于在线策略蒸馏的信任区域行为融合

    arXiv:2605.31159v1 Announce Type: cross Abstract: On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, pla…

  30. arXiv cs.AI TIER_1 English(EN) · Yanjiang Liu, Jie Lou, Xinyan Guan, Yuqiu Ji, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Xing Yu, Yaojie Lu ·

    你的老师在此无能为力:对抗在线策略蒸馏中的监督保真度衰减

    arXiv:2605.30833v1 Announce Type: cross Abstract: On-policy distillation transfers reasoning capabilities by training a student model on its own generated trajectories using token-level feedback from a teacher. However, we identify a critical bottleneck, \textbf{Supervision Fidel…

  31. Hugging Face Daily Papers TIER_1 English(EN) ·

    Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

    FiRe-OPD improves on-policy distillation in large language models by filtering low-quality trajectories and applying soft reweighting to enhance informative token selection and optimization stability.

  32. Hugging Face Daily Papers TIER_1 English(EN) ·

    OmniOPD:通过推测性验证实现无 Logit 的 On-Policy 蒸馏

    OmniOPD addresses limitations of standard On-Policy Distillation by using chunk-level semantic similarity instead of token-level logits, improving learning reliability and performance with black-box teachers.

  33. Hugging Face Daily Papers TIER_1 English(EN) ·

    Trust Region On-Policy Distillation

    Trust Region On-Policy Distillation (TrOPD) improves reliable token-level supervision in large language model distillation by using trust regions, outlier estimation, and off-policy guidance to address instability issues under distribution mismatch.

  34. arXiv cs.AI TIER_1 English(EN) · Daniil Gavrilov ·

    用于在线策略蒸馏的信任区域行为融合

    On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality pr…

  35. arXiv cs.AI TIER_1 English(EN) · Tommy He, Jerome Sieber, Matteo Saponati ·

    用于基于世界反馈的策略内自蒸馏的预测性法律

    arXiv:2605.30070v1 Announce Type: cross Abstract: Moving beyond simple scalar rewards toward richer world feedback is a natural path to more scalable RL post-training. On-policy self-distillation (OPSD) is a promising recent approach that uses arbitrary feedback as learning signa…

  36. arXiv cs.CL TIER_1 English(EN) · Haodi Lei, Yafy Li, Haoran Zhang, Shunkai Zhang, Qianjia Cheng, Xiaoye Qu, Ganqu Cui, Bowen Zhou, Ning Ding, Yun Luo, Yu Cheng ·

    Draft-OPD:用于推测性草稿模型的策略内蒸馏

    arXiv:2605.29343v1 Announce Type: new Abstract: Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is su…

  37. Hugging Face Daily Papers TIER_1 English(EN) ·

    用于在线策略蒸馏的信任区域行为融合

    Trust-Region behavior Blending improves on-policy distillation by replacing early poor-quality student rollouts with teacher-like behavior within a KL trust region during warmup.

  38. arXiv cs.AI TIER_1 English(EN) · Matteo Saponati ·

    基于世界反馈的策略内自蒸馏的预测性法律

    Moving beyond simple scalar rewards toward richer world feedback is a natural path to more scalable RL post-training. On-policy self-distillation (OPSD) is a promising recent approach that uses arbitrary feedback as learning signal, yet its reliability compared to established met…

  39. arXiv cs.AI TIER_1 English(EN) · Kun Liang, Chenming Tang, Clive Bai, Weijie Liu, Saiyong Yang, Yunfang Wu ·

    ADWIN:面向面向感知策略蒸馏的自适应窗口

    arXiv:2605.28396v1 Announce Type: cross Abstract: On-policy distillation (OPD) transfers reasoning behavior by training a student on teacher feedback along student-generated trajectories, but standard full-rollout training ties every update to a costly completion and can over-all…

  40. arXiv cs.LG TIER_1 English(EN) · Yunfang Wu ·

    ADWIN:面向面向感知策略蒸馏的自适应窗口

    On-policy distillation (OPD) transfers reasoning behavior by training a student on teacher feedback along student-generated trajectories, but standard full-rollout training ties every update to a costly completion and can over-allocate supervision to late positions with low margi…

  41. arXiv cs.AI TIER_1 English(EN) · Zhou Ziheng, Jiaqi Li, Huacong Tang, Ying Nian Wu, Demetri Terzopoulos ·

    少即是多:对在线策略蒸馏的提前停止推广

    arXiv:2605.27028v1 Announce Type: cross Abstract: On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' pro…

  42. arXiv cs.AI TIER_1 English(EN) · Demetri Terzopoulos ·

    少即是多:对在线策略蒸馏的早期停止推出

    On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' problem in this paradigm: for the later tokens, with …

  43. Hugging Face Daily Papers TIER_1 English(EN) ·

    少即是多:对在线策略蒸馏的提前停止推出

    On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' problem in this paradigm: for the later tokens, with …

  44. arXiv cs.AI TIER_1 English(EN) · Siqi Zhu, Xuyan Ye, Hongyu Lu, Weiye Shi, Ge Liu ·

    On-Policy Distillation 的多种面貌:陷阱、机制与修复

    arXiv:2605.11182v2 Announce Type: replace Abstract: On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for large language models, offering dense token-level supervision on trajectories sampled from the model's own p…

  45. Hugging Face Daily Papers TIER_1 English(EN) ·

    并非所有分歧都可学:On-Policy Distillation中的Token可教性

    Token-level teacher signals in on-policy distillation are better predicted by teachability—measuring local compatibility between teacher and student distributions—than by raw KL disagreement alone.

  46. Hugging Face Daily Papers TIER_1 English(EN) ·

    少即是多:对在线策略蒸馏的提前停止推广

    On-policy distillation suffers from teacher decay issues with later tokens, which are mitigated by Early Stopping Rollout that restricts training to initial response tokens, improving efficiency and stability.

  47. arXiv cs.CV TIER_1 English(EN) · Yi Wang ·

    ViCuR:视觉线索作为多模态按策略蒸馏的可恢复特权

    On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that observes training-time-only signals such as referenc…

  48. r/MachineLearning TIER_1 English(EN) · /u/NielsRogge ·

    On-policy distillation: one of the hottest terms on PapersWithCode [R]

    <table> <tr><td> <a href="https://www.reddit.com/r/MachineLearning/comments/1twmhud/onpolicy_distillation_one_of_the_hottest_terms_on/"> <img alt="On-policy distillation: one of the hottest terms on PapersWithCode [R]" src="https://preview.redd.it/yegq2gfag95h1.png?width=140&amp;…