English(EN) DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

新研究优化稀疏专家混合模型以实现高效的LLM扩展

作者 PulseAugur 编辑部 · [4 个来源] · 2026-05-11 17:58

研究人员正在探索优化稀疏专家混合（SMoE）模型的新方法，这对于高效扩展大型语言模型至关重要。一篇论文揭示了路由器和专家之间的几何耦合，表明匹配的方向会累积相似的路由令牌历史，并且辅助负载均衡损失会破坏这种结构。另一项研究系统地分析了超过2000次预训练运行，以优化专家数量和粒度等设计选择，发现这些因素比共享专家或负载均衡机制等其他因素影响更大。第三篇论文介绍了DECO，一种专为端侧设备设计的SMoE架构，它以显著更少的激活参数匹配密集Transformer的性能，并提供硬件加速。 AI

影响新研究探索了专家混合模型的架构优化，有望提高大型语言模型的效率和性能。

排序理由多篇在arXiv上发表的学术论文，详细介绍了稀疏专家混合架构的新方法和分析。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。我们如何撰写摘要 →

报道来源 [4]

arXiv cs.CL TIER_1 English(EN) · Mor Geva · 2026-05-12 17:55

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

Sparse Mixture-of-Experts (SMoE) models enable scaling language models efficiently, but training them remains challenging, as routing can collapse onto few experts and auxiliary load-balancing losses can reduce specialization. Motivated by these hurdles, we study how routing deci…
arXiv cs.CL TIER_1 English(EN) · Luke Zettlemoyer · 2026-05-12 07:47

Slicing and Dicing: Configuring Optimal Mixtures of Experts

Mixture-of-Experts (MoE) architectures have become standard in large language models, yet many of their core design choices - expert count, granularity, shared experts, load balancing, token dropping - have only been studied one or two at a time over narrow configuration ranges. …
arXiv cs.CL TIER_1 English(EN) · Zhiyuan Liu · 2026-05-11 17:58

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

While Mixture-of-Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment that simultaneously requires high perfor…
arXiv stat.ML TIER_1 English(EN) · Alessandro Breccia · 2026-05-13 23:32

How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

Recent frontier large language models predominantly rely on Mixture-of-Experts (MoE) architectures. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width $N$, expert width $N_e$, number of experts $M$, sparsi…

报道来源 [4]

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

Slicing and Dicing: Configuring Optimal Mixtures of Experts

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

相关实体

相关话题