Researchers have developed new methods to improve the efficiency and stability of on-policy distillation (OPD) for large language models. One approach, vOPD, uses a control variate baseline derived from the reverse KL divergence to reduce gradient variance without significant computational overhead. Another method, ROPD, enables rubric-based distillation using only teacher-generated responses, offering a black-box compatible alternative to logit-based OPD. A third technique, Near-Policy Distillation (NPD), accelerates the process through asynchronous generation and selective packing, achieving substantial speedups and outperforming standard fine-tuning. AI
Summary written by gemini-2.5-flash-lite from 5 sources. How we write summaries →
IMPACT These advancements offer more efficient and stable methods for aligning LLMs, potentially accelerating their deployment in complex reasoning tasks.
RANK_REASON Multiple arXiv papers introduce novel methods for improving on-policy distillation techniques in LLMs.