English(EN) Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

研究发现，优化器选择极大地改变了 Transformer 的缩放定律

作者 PulseAugur 编辑部 · [2 个来源] · 2026-05-20 00:00

一篇新的研究论文表明，即使架构保持不变，优化器的选择也会显著影响 Transformer 模型的能力和缩放定律。研究发现，与 AdamW 较弱的缩放相比，Muon 优化器在表示容量方面实现了线性缩放，提高了 2.3 倍，尤其是在具有挑战性的稀有 token 领域。这表明优化器应与架构和数据一起被视为模型缩放的主要因素，并强调了为获得更好性能而共同设计优化器和架构的潜力。 AI

影响强调优化器选择是实现最佳模型缩放和表示容量的关键且未被充分探索的因素。

排序理由该集群包含一篇详细介绍模型训练新研究发现的学术论文。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.LG TIER_1 English(EN) · Nandan Kumar Jha, Brandon Reagen · 2026-05-22 04:00

Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

arXiv:2605.21803v1 Announce Type: new Abstract: Scaling laws have made language-model performance predictable from model size, data, and compute, but they typically treat the optimizer as a fixed training detail. We show that this assumption misses a fundamental axis of represent…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-20 00:00

Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

Different optimizers produce distinct spectral scaling behaviors in Transformer models, with Muon achieving superior scaling efficiency compared to AdamW in representation capacity utilization.

报道来源 [2]

Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

相关实体

相关话题