STRIDE: Strategic Trajectory Reasoning via Discriminative Estimation for Verifiable Reinforcement Learning
Researchers have introduced STRIDE, a novel framework for Reinforcement Learning with Verifiable Rewards (RLVR) designed to enhance the reasoning capabilities of large language models. Unlike previous methods that rely on final-answer correctness, STRIDE employs a fine-grained approach by deriving supervision from verifiable outcomes. It contrasts successful and failed trajectories to estimate the outcome-discriminative preference of each n-gram strategic pattern, allowing for more precise credit assignment during RL optimization. Experiments show STRIDE consistently improves reasoning performance across various models and tasks, including Vision-Language Models and agent-based systems. AI
IMPACT This framework could lead to more reliable and verifiable reasoning in LLMs, improving their performance on complex tasks.