PulseAugur
实时 22:49:27

New research optimizes Sparse Mixture-of-Experts for efficient LLM scaling

Researchers are exploring new methods to optimize Sparse Mixture-of-Experts (SMoE) models, which are crucial for scaling large language models efficiently. One paper reveals a geometric coupling between routers and experts, suggesting that matched directions accumulate similar routed token histories and that auxiliary load-balancing losses can disrupt this structure. Another study systematically analyzed over 2,000 pretraining runs to optimize design choices like expert count and granularity, finding that these factors have a greater impact than others such as shared experts or load-balancing mechanisms. A third paper introduces DECO, an SMoE architecture designed for end-side devices that matches dense Transformer performance with significantly fewer active parameters and offers hardware acceleration. AI

影响 New research explores architectural optimizations for Mixture-of-Experts models, potentially improving efficiency and performance for large language models.

排序理由 Multiple academic papers published on arXiv detailing new methods and analyses for Sparse Mixture-of-Experts architectures.

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。 我们如何撰写摘要 →

New research optimizes Sparse Mixture-of-Experts for efficient LLM scaling

报道来源 [4]

  1. arXiv cs.CL TIER_1 English(EN) · Mor Geva ·

    Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

    Sparse Mixture-of-Experts (SMoE) models enable scaling language models efficiently, but training them remains challenging, as routing can collapse onto few experts and auxiliary load-balancing losses can reduce specialization. Motivated by these hurdles, we study how routing deci…

  2. arXiv cs.CL TIER_1 English(EN) · Luke Zettlemoyer ·

    Slicing and Dicing: Configuring Optimal Mixtures of Experts

    Mixture-of-Experts (MoE) architectures have become standard in large language models, yet many of their core design choices - expert count, granularity, shared experts, load balancing, token dropping - have only been studied one or two at a time over narrow configuration ranges. …

  3. arXiv cs.CL TIER_1 English(EN) · Zhiyuan Liu ·

    DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

    While Mixture-of-Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment that simultaneously requires high perfor…

  4. arXiv stat.ML TIER_1 English(EN) · Alessandro Breccia ·

    How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

    Recent frontier large language models predominantly rely on Mixture-of-Experts (MoE) architectures. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width $N$, expert width $N_e$, number of experts $M$, sparsi…