PulseAugur
实时 13:27:19
English(EN) Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

新的TAPO方法通过显式纠错增强LLM自蒸馏

研究人员推出了一种用于大型语言模型自蒸馏的新方法——轨迹增强策略优化(TAPO)。与最小化KL散度的传统方法不同,TAPO通过保留错误推理直至失败点,然后纳入自然语言诊断和纠正的推理来构建显式的训练轨迹。该方法旨在提供更精细的错误纠正,并在AIME 2024、AIME 2025和HMMT 2025的实验中显示出比GRPO持续的改进。 AI

影响 该方法通过提供有针对性的错误纠正,可能导致更有效和高效的LLM训练。

排序理由 该集群包含一篇详细介绍LLM自蒸馏新方法的学术论文。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →

报道来源 [3]

  1. arXiv cs.LG TIER_1 English(EN) · Zhilin Huang, Hang Gao, Ziqiang Dong, Yuan Chen, Yifeng Luo, Chujun Qin, Jingyi Wang, Yang Yang, Guanjun Jiang ·

    Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

    arXiv:2606.18844v1 Announce Type: new Abstract: Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target distributio…

  2. arXiv cs.LG TIER_1 English(EN) · Guanjun Jiang ·

    Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

    Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target distribution. However, because this supervision is generate…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

    Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target distribution. However, because this supervision is generate…