PulseAugur
实时 11:06:09
English(EN) CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

新的自蒸馏方法提高了大型语言模型在推理任务上的性能

研究人员开发了新的大型语言模型自蒸馏技术,可在不依赖外部反馈的情况下提高其性能。AVSD(自适应视图自蒸馏)在多个特权信息视图之间平衡共识信号,并使用视图特定的残差来增强学习。自策略蒸馏(SPD)从梯度中提取能力子空间,以提高性能和泛化能力,尤其是在代码生成和数学推理方面。CEPO(对比证据策略优化)通过对比正确答案和错误答案来锐化关键标记的信用分配,从而提高了多模态数学推理基准的准确性。 AI

影响 这些自蒸馏技术在没有外部监督的情况下,提高了大型语言模型在复杂推理任务上的性能和泛化能力。

排序理由 该集群包含多篇研究论文,详细介绍了大型语言模型自蒸馏的新颖方法。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 5 个来源。 我们如何撰写摘要 →

新的自蒸馏方法提高了大型语言模型在推理任务上的性能

报道来源 [5]

  1. arXiv cs.AI TIER_1 English(EN) · Duy Nguyen, Hanqi Xiao, Archiki Prasad, Zaid Khan, Anirban Das, Austin Zhang, Sambit Sahu, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal ·

    AVSD:通过平衡共识和特定教师的特权信号进行自适应视图自蒸馏

    arXiv:2605.20643v1 Announce Type: cross Abstract: Self-distillation enables language models to learn on-policy from their own trajectories by using the same model as both student and teacher, with the teacher being conditioned on privileged information unavailable to the student.…

  2. arXiv cs.CL TIER_1 English(EN) · Guangya Hao, Yitong Shang, Yunbo Long, Zhuokai Zhao, Hanxue Liang ·

    通过能力选择子空间投影实现自策略蒸馏

    arXiv:2605.22675v1 Announce Type: new Abstract: Self-distillation bootstraps large language models (LLMs) by training on their own generations. However, existing methods either rely on external signals to curate self-generated outputs (e.g., correctness filtering, execution feedb…

  3. arXiv cs.CL TIER_1 English(EN) · Hanxue Liang ·

    通过能力选择子空间投影进行自策略蒸馏

    Self-distillation bootstraps large language models (LLMs) by training on their own generations. However, existing methods either rely on external signals to curate self-generated outputs (e.g., correctness filtering, execution feedback, and reward search), which are costly and un…

  4. arXiv cs.AI TIER_1 English(EN) · Mohit Bansal ·

    AVSD:通过平衡共识和特定教师的特权信号进行自适应视图自蒸馏

    Self-distillation enables language models to learn on-policy from their own trajectories by using the same model as both student and teacher, with the teacher being conditioned on privileged information unavailable to the student. Such information can come in different types or v…

  5. arXiv cs.CL TIER_1 English(EN) · Salman Khan ·

    CEPO:使用对比证据策略优化进行RLVR自蒸馏

    When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct…