Brief · PulseAugur

TOOL · arXiv cs.CL English(EN) · 8h

Why Tree-Style Branching Matters for Thought Advantage Estimation in GRPO

A new research paper explores the effectiveness of tree-style branching in Group Relative Policy Optimization (GRPO), a method for training Chain-of-Thought reasoning in AI models. The study, utilizing the multivariate delta method, reveals that while increasing the number of sampled thoughts has a limited impact due to a variance floor, increasing the number of continuations per thought can significantly reduce estimation variance. This suggests that continuation-level branching is a crucial mechanism for accurate advantage estimation in GRPO, rather than merely a heuristic. Experiments across various domains and model architectures validate these findings, showing improvements in training stability, efficiency, and overall performance. AI

Hugging Face
Group Relative Policy Optimization
GRPO
Hongcheng Wang
multivariate delta method