Negative Advantages Is a Double-Edged Sword: Calibrating advantages in GRPO for Search Agents
A new method called CalibAdv has been developed to improve the training stability and performance of search agents, particularly those using Group Relative Policy Optimization (GRPO). This approach addresses issues where correct intermediate steps are penalized due to final answer errors and where training can become unstable, leading to performance degradation. CalibAdv achieves this by fine-tuning the assignment of advantages, downscaling excessive negative advantages based on intermediate step correctness and rebalancing positive and negative advantages for more stable modeling of rewards and penalties. AI
IMPACT Improves training stability and performance for search agents, potentially leading to more reliable AI-powered search functionalities.