Cross-Epoch Adaptive Rollout Optimization for RL Post-Training
Researchers have developed a new method called CERO for optimizing reinforcement learning post-training in large language models. CERO adaptively allocates a fixed budget of rollouts across different prompts, unlike previous methods that used a static budget. This approach uses Bayesian estimates of prompt success probabilities to determine the value of additional rollouts, leading to improved sample efficiency. Experiments showed CERO outperformed existing methods on mathematical reasoning tasks with various open-weight LLMs. AI
IMPACT Improves sample efficiency in LLM training, potentially leading to faster development of more capable models.