Brief · PulseAugur

RESEARCH · arXiv cs.LG English(EN) · 1d · [2 sources]

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

Researchers have introduced Trajectory-Augmented Policy Optimization (TAPO), a novel method for self-distillation in large language models. Unlike traditional approaches that minimize KL divergence, TAPO constructs explicit training trajectories by retaining erroneous reasoning up to the point of failure, then incorporating natural-language diagnoses and corrected reasoning. This method aims to provide more fine-grained error correction and has demonstrated consistent improvements over GRPO in experiments on AIME 2024, AIME 2025, and HMMT 2025. AI

IMPACT This method could lead to more efficient and effective LLM training by providing targeted error correction.

AIME 2025
Grpo
AIME 2024
Tapolca
Trajectory-Augmented Policy Optimization
HMMT 2025