English(EN) Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training

大语言模型预训练研究探索稀疏与密集及低秩方法

作者 PulseAugur 编辑部 · [2 个来源] · 2026-05-13 15:11

两篇新研究论文探讨了大语言模型高效预训练的方法。第一篇论文在小规模上比较了密集和稀疏的专家混合（MoE）Transformer架构，发现MoE模型在匹配激活参数时能改善验证损失，但在总参数容量相等的情况下，其性能并不超过密集模型。第二篇论文研究了各种低秩预训练技术，表明即使验证困惑度相似，这些方法也会收敛到几何上不同的解，并且不能完全复制全秩训练的泛化能力或内部表示。 AI

影响这些研究为优化大语言模型训练效率和理解不同架构及优化方法的权衡提供了见解。

排序理由两篇在arXiv上发表的学术论文，详细介绍了关于大语言模型预训练方法学的新研究。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.CL TIER_1 English(EN) · Abdalrahman Wael · 2026-05-13 16:48

密集 vs. 稀疏预训练在极小规模下的对比：激活参数 vs. 总参数匹配

We study dense and mixture-of-experts (MoE) transformers in a tiny-scale pretraining regime under a shared LLaMA-style decoder training recipe. The sparse model replaces dense feed-forward blocks with Mixtral-style routed experts. Dense baselines are modestly width-resized to tig…
arXiv cs.AI TIER_1 English(EN) · Anna Rumshisky · 2026-05-13 15:11

超越困惑度：低秩预训练的几何与谱研究

Pre-training large language models is dominated by the memory cost of storing full-rank weights, gradients, and optimizer states. Low-rank pre-training has emerged to address this, and the space of methods has grown rapidly. A central question remains open: do low-rank methods pr…

报道来源 [2]

密集 vs. 稀疏预训练在极小规模下的对比：激活参数 vs. 总参数匹配

超越困惑度：低秩预训练的几何与谱研究

相关实体

相关话题