Researchers have developed new methods to make on-policy distillation (OPD) more efficient for training AI models on long-horizon reasoning tasks. Standard OPD requires full rollouts, which are computationally expensive and can provide unreliable feedback early in training. The proposed techniques, Progressive OPD (POPD) and Truncated OPD (TOPD), optimize the rollout horizon. POPD gradually increases the rollout length during training, while TOPD uses only a fraction of the rollout horizon. Experiments show POPD can improve training efficiency up to threefold, and TOPD achieves comparable performance with significantly reduced computational resources. AI
IMPACT Optimizes AI training for complex reasoning tasks, potentially reducing computational costs and accelerating development.
RANK_REASON This is a research paper detailing new methods for AI training.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →