Brief · PulseAugur

TOOL · arXiv cs.CL English(EN) · 10h

A Survey of On-Policy Distillation for Large Language Models

A new survey paper published on arXiv details On-Policy Distillation (OPD), a technique for transferring capabilities from large, costly language models to smaller, more deployable ones. Unlike traditional imitation learning, OPD reframes distillation as an iterative correction process where a teacher model provides feedback on the student's actual outputs. This approach aims to mitigate the compounding error, or exposure bias, that arises in longer, reasoning-intensive tasks when students are trained on perfect teacher prefixes but generate their own text during inference. The survey organizes the field along key design axes and discusses its connections to reinforcement learning and imitation learning. AI

IMPACT This technique could enable more efficient deployment of powerful LLM capabilities into smaller, cost-effective models.

arXiv
reinforcement learning from human feedback
reinforcement learning
large-language models
On-Policy Distillation
imitation learning
$f$-divergence
Mingyang Song