New distillation method tackles position bias in reinforcement learning

By PulseAugur Editorial · [1 sources] · 2026-06-21 17:20

Researchers have identified a position bias in On-Policy Distillation (OPD), a method used to improve reinforcement learning efficiency. They found that OPD's standard KL objective uniformly weights all tokens, but later tokens in longer rollouts degrade supervision quality. This leads to performance comparable to using only the initial 30% of tokens, while the last 30% yield minimal learning. To address this, the team developed Importance-Weighted On-Policy Distillation (IW-OPD), which assigns weights based on the accumulated distribution discrepancy between student and teacher models, effectively upweighting earlier tokens. IW-OPD demonstrates faster convergence, improved learning efficiency, and better final performance, achieving up to a 6.9-point improvement on the AIME-2025 benchmark. AI

IMPACT Improves reinforcement learning efficiency and performance by addressing token bias in distillation methods.

RANK_REASON Academic paper detailing a new method for reinforcement learning. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

paper
other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New distillation method tackles position bias in reinforcement learning

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Yifei Wang · 2026-06-21 17:20

On the Position Bias of On-Policy Distillation

On-Policy Distillation (OPD) improves the learning efficiency of standard reinforcement learning through dense, token-level supervision from teachers. In the standard KL objective of OPD, token-level losses are uniformly averaged, implying equal weights for all tokens. However, w…

COVERAGE [1]

On the Position Bias of On-Policy Distillation

RELATED ENTITIES

RELATED TOPICS