Brief · PulseAugur

RESEARCH · arXiv cs.CL English(EN) · 2w · [48 sources]

Trust Region On-Policy Distillation

Researchers are exploring advanced techniques in on-policy distillation (OPD) for large language models to improve training stability and efficiency. Several papers introduce methods to refine how teacher models guide student models, focusing on selective learning, adaptive weighting, and better credit assignment. These approaches aim to overcome challenges like state-oblivious collapse, unreliable supervision signals, and the optimization of AI

On-Policy Distillation
Alexey Gorbatovski
ADWIN
Lookahead Group Reward
Trust-Region Behavior Blending
Early Stopping Rollout
Supervision Fidelity Decay