Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data
A new paper demonstrates that transformers trained with outcome-based reinforcement learning can develop reasoning abilities, specifically by generating intermediate steps like Chain-of-Thought. The research proves that even with sparse rewards focused on final answer correctness, policy gradients can guide transformers to learn structured, iterative algorithms for tasks like graph traversal. Crucially, the study highlights that the emergence of this reasoning capability is dependent on the training data distribution, requiring a sufficient number of simpler examples to generalize effectively. AI
IMPACT Demonstrates a theoretical pathway for emergent reasoning in LLMs, potentially guiding future training methodologies for improved performance on complex tasks.