Why Tree-Style Branching Matters for Thought Advantage Estimation in GRPO
A new research paper explores the effectiveness of tree-style branching in Group Relative Policy Optimization (GRPO), a method for training Chain-of-Thought reasoning in AI models. The study, utilizing the multivariate delta method, reveals that while increasing the number of sampled thoughts has a limited impact due to a variance floor, increasing the number of continuations per thought can significantly reduce estimation variance. This suggests that continuation-level branching is a crucial mechanism for accurate advantage estimation in GRPO, rather than merely a heuristic. Experiments across various domains and model architectures validate these findings, showing improvements in training stability, efficiency, and overall performance. AI