Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
Researchers have developed a new framework called POW3R to improve reinforcement learning with verifiable rewards (RLVR). This method addresses the issue where static rubric rewards in RLVR may not effectively guide training by adapting criterion weights based on their current usefulness to the policy. POW3R uses rollout-level contrast to highlight criteria that differentiate policy outputs, making the reward signal more informative without altering the evaluation target. Experiments show POW3R significantly improves both mean rubric reward and strict completion rates across various tasks and datasets, often reaching optimal performance in fewer training steps. AI
IMPACT Enhances reinforcement learning by making reward signals more informative, potentially accelerating model training and improving performance on complex tasks.