Researchers from Kuaishou's Kwaipilot team have developed a novel reinforcement learning framework called SRPO, designed to improve the efficiency and performance of large language models. This new method addresses limitations in standard GRPO, such as sample inefficiency and cross-domain optimization conflicts, by employing a two-stage training process. SRPO has demonstrated state-of-the-art performance on mathematical and code benchmarks, matching DeepSeek-R1-Zero while requiring only one-tenth of the training steps. AI
排序理由 Open-source release of a novel training method and model from a non-frontier lab, achieving competitive benchmark results.
- AIME24
- DeepSeek-R1
- DeepSeek-R1-Zero
- Kuaishou
- Kwai AI
- Kwaipilot
- LiveCodeBench
- LLMs
- OpenAI
- Qwen2.5-32B
- SRPO-Qwen-32B
- GRPO
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →