Kwai AI's SRPO achieves DeepSeek-R1-Zero performance with 10x fewer training steps

By PulseAugur Editorial · [1 sources] · 2025-04-24 02:30

Researchers from Kuaishou's Kwaipilot team have developed a novel reinforcement learning framework called SRPO, designed to improve the efficiency and performance of large language models. This new method addresses limitations in standard GRPO, such as sample inefficiency and cross-domain optimization conflicts, by employing a two-stage training process. SRPO has demonstrated state-of-the-art performance on mathematical and code benchmarks, matching DeepSeek-R1-Zero while requiring only one-tenth of the training steps. AI

RANK_REASON Open-source release of a novel training method and model from a non-frontier lab, achieving competitive benchmark results.

Read on Synced Review →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Kwai AI's SRPO achieves DeepSeek-R1-Zero performance with 10x fewer training steps

COVERAGE [1]

Synced Review TIER_1 English(EN) · Synced · 2025-04-24 02:30

Can GRPO be 10x Efficient? Kwai AI’s SRPO Suggests Yes with SRPO

<p>Kwai AI's SRPO framework slashes LLM RL post-training steps by 90% while matching DeepSeek-R1 performance in math and code. This two-stage RL approach with history resampling overcomes GRPO limitations.</p> The post <a href="https://syncedreview.com/2025/04/23/can-grpo-be-10x-…

COVERAGE [1]

Can GRPO be 10x Efficient? Kwai AI’s SRPO Suggests Yes with SRPO

RELATED ENTITIES

RELATED TOPICS