DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning
Researchers have introduced DRA-GRPO, a novel framework designed to enhance mathematical reasoning in large language models by addressing the Diversity-Quality Inconsistency inherent in standard GRPO methods. This new approach calibrates reward signals using semantic density and Submodular Mutual Information to de-bias gradient estimation, encouraging the model to explore a wider range of valid reasoning strategies. Empirical results on five mathematical benchmarks show that DRA-GRPO significantly outperforms existing methods, achieving 58.2% accuracy on the DeepSeek-R1-Distill-Qwen-1.5B dataset with a limited number of training samples and a low cost. AI
IMPACT Enhances LLM mathematical reasoning by promoting diverse problem-solving strategies, potentially improving performance on complex tasks.