English(EN) Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

新的TAPO方法通过显式纠错增强LLM自蒸馏

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-17 09:24

研究人员推出了一种用于大型语言模型自蒸馏的新方法——轨迹增强策略优化（TAPO）。与最小化KL散度的传统方法不同，TAPO通过保留错误推理直至失败点，然后纳入自然语言诊断和纠正的推理来构建显式的训练轨迹。该方法旨在提供更精细的错误纠正，并在AIME 2024、AIME 2025和HMMT 2025的实验中显示出比GRPO持续的改进。 AI

影响该方法通过提供有针对性的错误纠正，可能导致更有效和高效的LLM训练。

排序理由该集群包含一篇详细介绍LLM自蒸馏新方法的学术论文。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.LG TIER_1 English(EN) · Zhilin Huang, Hang Gao, Ziqiang Dong, Yuan Chen, Yifeng Luo, Chujun Qin, Jingyi Wang, Yang Yang, Guanjun Jiang · 2026-06-18 04:00

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

arXiv:2606.18844v1 Announce Type: new Abstract: Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target distributio…
arXiv cs.LG TIER_1 English(EN) · Guanjun Jiang · 2026-06-17 09:24

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target distribution. However, because this supervision is generate…

报道来源 [2]

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

相关实体

相关话题