Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals
Researchers have identified a key bottleneck in Reinforcement Learning from Verifiable Rewards (RLVR) that hinders LLM reasoning optimization. The study pinpoints rigid clipping decisions in standard hard-clipping methods as the cause, which discards valuable signals near the clipping threshold. To address this, they propose Near-boundary Stochastic Rescue (NSR), a simple modification that stochastically retains these slightly out-of-bound tokens, improving training stability and performance across various model sizes and architectures. AI
IMPACT Improves training stability and performance for LLM reasoning tasks, potentially enabling more robust and capable models.