Trajectory-Refined Distillation
Researchers have introduced Trajectory-Refined Distillation (TRD), a new method to improve the post-training process for large language models. TRD addresses a problem called "prefix failure" in on-policy distillation, where dense per-token supervision leads to fragmented gradients. By correcting student model rollouts at the trajectory level before distillation, TRD mitigates this issue and enhances exploration. The method has demonstrated consistent performance improvements across various benchmarks and model scales. AI
IMPACT Enhances LLM reasoning and accuracy by refining distillation techniques.