None Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates

LLM 训练研究探索蒸馏、反馈和优化器

作者 PulseAugur 编辑部 · [5 sources] · 2026-05-18 03:09

新研究探索了提高大型语言模型 (LLM) 训练效率和有效性的方法。一项研究挑战了知识蒸馏中强教师模型的必要性，发现即使是较小的教师也能通过适当的损失混合使较大的学生受益。另一篇论文介绍了“内省训练” (IXT)，它使用条件反馈数据来改进 LLM 训练所有阶段的扩展和性能，从而带来显著的计算效率提升。此外，关于优化器的研究表明，通过裁剪机制稳定随机梯度下降 (SGD) 可以帮助其在 LLM 预训练中达到与 Adam 等自适应优化器相当的性能。 AI

影响这些论文探索了更高效、更有效的 LLM 训练新技术，可能带来更好的性能和更低的计算成本。

排序理由该集群包含多篇详细介绍 LLM 训练新颖研究和方法的学术论文。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 5 个来源。我们如何撰写摘要 →

报道来源 [5]

arXiv cs.CL TIER_1 · Taiming Lu, Zhuang Liu · 2026-05-25 04:00

Strong Teacher Not Needed? On Distillation in LLM Pretraining

arXiv:2605.23857v1 Announce Type: cross Abstract: Knowledge distillation generally assumes a strong-to-weak relationship where stronger teachers yield better students. In this work, we examine this assumption about distillation in large language model pretraining. By varying arch…
arXiv cs.LG TIER_1 · Lukas Thede, Stefan Winzeck, Zeynep Akata, Jonathan Richard Schwarz · 2026-05-25 04:00

CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

arXiv:2603.06610v2 Announce Type: replace Abstract: Large language model (LLM) post-training enhances latent skills, unlocks value alignment, improves performance, and enables domain adaptation. Unfortunately, post-training is known to induce forgetting, especially in the ubiquit…
arXiv cs.CL TIER_1 · Zhuang Liu · 2026-05-22 17:16

Strong Teacher Not Needed? On Distillation in LLM Pretraining

Knowledge distillation generally assumes a strong-to-weak relationship where stronger teachers yield better students. In this work, we examine this assumption about distillation in large language model pretraining. By varying architecture sizes and training token budgets, we crea…
arXiv cs.AI TIER_1 · Brandon Cui, Ximing Lu, Jaehun Jung, Syeda Nahida Akter, Hyunwoo Kim, Yuxiao Qu, David Acuna, Shrimai Prabhumoye, Yejin Choi, Prithviraj Ammanabrolu · 2026-05-22 04:00

Introspective X Training: Feedback Conditioning Improves Scaling Across all LLM Training Stages

arXiv:2605.20285v1 Announce Type: cross Abstract: We tackle the question of how to scale more efficiently across the many, ever-growing stages of current LLM training pipelines. Our guiding intuition stems from the fact that the dynamics of later stages of the pipeline, e.g. post…
Hugging Face Daily Papers TIER_1 · 2026-05-18 03:09

Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates

It is widely believed that stochastic gradient descent (SGD) performs significantly worse than adaptive optimizers such as Adam in pre-training Large Language Models (LLMs). Yet the underlying reason for this gap remains unclear. In this work, we attribute a large part of the dis…

报道来源 [5]

Strong Teacher Not Needed? On Distillation in LLM Pretraining

CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

Strong Teacher Not Needed? On Distillation in LLM Pretraining

Introspective X Training: Feedback Conditioning Improves Scaling Across all LLM Training Stages

Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates

相关实体

相关话题