Brief · PulseAugur

TOOL · arXiv cs.CL English(EN) · 8h

DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning

Researchers have introduced DRA-GRPO, a novel framework designed to enhance mathematical reasoning in large language models by addressing the Diversity-Quality Inconsistency inherent in standard GRPO methods. This new approach calibrates reward signals using semantic density and Submodular Mutual Information to de-bias gradient estimation, encouraging the model to explore a wider range of valid reasoning strategies. Empirical results on five mathematical benchmarks show that DRA-GRPO significantly outperforms existing methods, achieving 58.2% accuracy on the DeepSeek-R1-Distill-Qwen-1.5B dataset with a limited number of training samples and a low cost. AI

IMPACT Enhances LLM mathematical reasoning by promoting diverse problem-solving strategies, potentially improving performance on complex tasks.

Group Relative Policy Optimization
DeepSeek-R1-Distill-Qwen-1.5B
DRA-GRPO
Diversity-aware Reward Adjustment
Submodular Mutual Information
Inverse Propensity Scoring
Xiwen Chen