PulseAugur
实时 21:46:14

Kwai AI's SRPO achieves DeepSeek-R1-Zero performance with 10x fewer training steps

Researchers from Kuaishou's Kwaipilot team have developed a novel reinforcement learning framework called SRPO, designed to improve the efficiency and performance of large language models. This new method addresses limitations in standard GRPO, such as sample inefficiency and cross-domain optimization conflicts, by employing a two-stage training process. SRPO has demonstrated state-of-the-art performance on mathematical and code benchmarks, matching DeepSeek-R1-Zero while requiring only one-tenth of the training steps. AI

排序理由 Open-source release of a novel training method and model from a non-frontier lab, achieving competitive benchmark results.

在 Synced Review 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

Kwai AI's SRPO achieves DeepSeek-R1-Zero performance with 10x fewer training steps

报道来源 [1]

  1. Synced Review TIER_1 English(EN) · Synced ·

    Can GRPO be 10x Efficient? Kwai AI’s SRPO Suggests Yes with SRPO

    <p>Kwai AI's SRPO framework slashes LLM RL post-training steps by 90% while matching DeepSeek-R1 performance in math and code. This two-stage RL approach with history resampling overcomes GRPO limitations.</p> The post <a href="https://syncedreview.com/2025/04/23/can-grpo-be-10x-…