Researchers have identified significant empirical failure modes in on-policy distillation (OPD), a technique used for post-training large language models. The standard implementation, which relies on sampled-token log-ratios, can lead to unstable learning signals, especially with long sequences where prefixes diverge from the teacher model's typical output. To address this, the paper proposes a new objective called teacher top-K local support matching, which improves optimization stability and offers a practical method for more robust on-policy distillation. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Improves optimization stability and performance for on-policy distillation, offering a practical recipe for more stable LLM post-training.
RANK_REASON The cluster contains an academic paper detailing empirical failure modes and proposed fixes for a specific LLM training technique.