Researchers have introduced DRA-GRPO, a novel framework designed to enhance mathematical reasoning in large language models by addressing the Diversity-Quality Inconsistency inherent in standard GRPO methods. This new approach calibrates reward signals using semantic density and Submodular Mutual Information to de-bias gradient estimation, encouraging the model to explore a wider range of valid reasoning strategies. Empirical results on five mathematical benchmarks show that DRA-GRPO significantly outperforms existing methods, achieving 58.2% accuracy on the DeepSeek-R1-Distill-Qwen-1.5B dataset with a limited number of training samples and a low cost. AI
IMPACT Enhances LLM mathematical reasoning by promoting diverse problem-solving strategies, potentially improving performance on complex tasks.
RANK_REASON The cluster contains an academic paper detailing a new method for improving LLM reasoning capabilities. [lever_c_demoted from research: ic=1 ai=1.0]
- DeepSeek-R1-Distill-Qwen-1.5B
- Diversity-aware Reward Adjustment
- DRA-GRPO
- Group Relative Policy Optimization
- Inverse Propensity Scoring
- Submodular Mutual Information
- Xiwen Chen
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →