Beyond the Sampled Token: Preserving Candidate Support in RLVR
Researchers have identified a key issue in reinforcement learning with verifiable rewards (RLVR) known as exploration collapse, where probability concentrates on the top-ranked response, limiting distinct outcomes. To address this, a new method called Candidate-aware Support Preservation (CaSP) has been proposed. CaSP works by adjusting gradients for correct responses and penalizing incorrect top responses, improving performance across various benchmarks and model sizes. AI
IMPACT This research introduces a novel approach to improve exploration in RLVR, potentially leading to more diverse and effective AI responses in complex tasks.