Researchers have identified a position bias in On-Policy Distillation (OPD), a method used to improve reinforcement learning efficiency. They found that OPD's standard KL objective uniformly weights all tokens, but later tokens in longer rollouts degrade supervision quality. This leads to performance comparable to using only the initial 30% of tokens, while the last 30% yield minimal learning. To address this, the team developed Importance-Weighted On-Policy Distillation (IW-OPD), which assigns weights based on the accumulated distribution discrepancy between student and teacher models, effectively upweighting earlier tokens. IW-OPD demonstrates faster convergence, improved learning efficiency, and better final performance, achieving up to a 6.9-point improvement on the AIME-2025 benchmark. AI
IMPACT Improves reinforcement learning efficiency and performance by addressing token bias in distillation methods.
RANK_REASON Academic paper detailing a new method for reinforcement learning. [lever_c_demoted from research: ic=1 ai=1.0]
- AIME 2025
- alphaXiv
- arXiv
- CatalyzeX
- DagsHub
- Gotit.pub
- Hugging Face
- Importance-Weighted On-Policy Distillation
- On-Policy Distillation
- ScienceCast
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →