Researchers have developed new methods for improving large language model reasoning capabilities, particularly for long-context and multilingual tasks. One approach, OGLS-SD, uses outcome-guided logit steering to calibrate teacher model responses during on-policy self-distillation, leading to more stable and effective reasoning. Another method, dGRPO, combines on-policy optimization with distillation to enhance long-context reasoning and introduces a new dataset called LongBlocks. Additionally, COPSD specifically targets low-resource languages by transferring reasoning behavior from high-resource languages through self-distillation, showing significant improvements in multilingual mathematical reasoning. AI
影响 These new techniques offer improved stability and effectiveness for LLM reasoning, particularly in challenging long-context and multilingual scenarios, potentially broadening their applicability.
排序理由 Multiple arXiv papers detailing new methods for improving LLM reasoning.
- COPSD
- Crosslingual On-Policy Self-Distillation
- Group Relative Policy Optimization
- Large language models
- dGRPO
- LongBlocks
- OGLS-SD
- GRPO
AI 生成摘要 · Google Gemini · 来自 4 个来源。 我们如何撰写摘要 →