Researchers have developed new methods to improve the efficiency and stability of on-policy distillation (OPD) for large language models. One approach, vOPD, uses a control variate baseline derived from the reverse KL divergence to reduce gradient variance without significant computational overhead. Another method, ROPD, enables rubric-based distillation using only teacher-generated responses, offering a black-box compatible alternative to logit-based OPD. A third technique, Near-Policy Distillation (NPD), accelerates the process through asynchronous generation and selective packing, achieving substantial speedups and outperforming standard fine-tuning. AI
影响 These advancements offer more efficient and stable methods for aligning LLMs, potentially accelerating their deployment in complex reasoning tasks.
排序理由 Multiple arXiv papers introduce novel methods for improving on-policy distillation techniques in LLMs.
- Near-Policy Distillation
- openPangu-Embedded-1B
- Qwen3-1.7B
- Reinforcement Learning
- Supervised Fine-Tuning
- On-Policy Distillation
- ROPD
- vOPD
AI 生成摘要 · Google Gemini · 来自 5 个来源。 我们如何撰写摘要 →