Researchers refine on-policy distillation for more stable LLM training

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have identified significant empirical failure modes in on-policy distillation (OPD), a technique used for post-training large language models. The standard implementation, which relies on sampled-token log-ratios, can lead to unstable learning signals, especially with long sequences where prefixes diverge from the teacher model's typical output. To address this, the paper proposes a new objective called teacher top-K local support matching, which improves optimization stability and offers a practical method for more robust on-policy distillation. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Improves optimization stability and performance for on-policy distillation, offering a practical recipe for more stable LLM post-training.

RANK_REASON The cluster contains an academic paper detailing empirical failure modes and proposed fixes for a specific LLM training technique.

Read on arXiv cs.CL →

COVERAGE [1]

arXiv cs.CL TIER_1 · Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, Dongbin Zhao · 2026-04-28 04:00

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

arXiv:2603.25562v2 Announce Type: replace-cross Abstract: On-policy distillation (OPD) is increasingly used in LLM post-training because it can leverage a teacher model to provide dense supervision on student rollouts. The standard implementation, however, usually reduces distrib…

COVERAGE [1]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

RELATED ENTITIES

RELATED TOPICS