JetBrains releases efficient Mellum2 MoE model; research advances MoE techniques

By PulseAugur Editorial · [15 sources] · 2026-05-30 00:00

JetBrains has released Mellum2, an open-source 12-billion parameter Mixture-of-Experts (MoE) model optimized for efficient inference in text and code tasks. This model activates only a fraction of its parameters per token, enabling faster, lower-latency operations suitable for routing, RAG pipelines, and sub-agent tasks within larger AI systems. Several research papers also explore advancements in MoE architectures, including efficient serving techniques like CRAFT, novel aggregation methods like DAG-MoE, adaptive gating with Kappa-SwiGLU, and probabilistic routing with ProbMoE, alongside game-theory inspired expert merging strategies. AI

IMPACT Mellum2's efficiency and specialized design offer a faster, cheaper alternative for specific tasks within larger AI systems, potentially accelerating the adoption of modular AI architectures.

RANK_REASON JetBrains released a new model, Mellum2, which is a 12B parameter Mixture-of-Experts model.

Read on Hugging Face Blog →

AI-generated summary · Google Gemini · from 15 sources. How we write summaries →

JetBrains releases efficient Mellum2 MoE model; research advances MoE techniques

COVERAGE [15]

Hugging Face Blog TIER_1 English(EN) · 2026-06-01 15:45

Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains
arXiv cs.LG TIER_1 English(EN) · Simon Schug · 2026-06-08 04:00

Sparsely gated tiny linear experts

arXiv:2606.07414v1 Announce Type: new Abstract: Sparsity allows scaling model parameters without proportionally increasing computational cost. While mixture of experts (MoE) models are made increasingly sparse, individual experts typically remain large and dense. Here, we demonst…
arXiv cs.LG TIER_1 English(EN) · Simon Schug · 2026-06-05 16:06

Sparsely gated tiny linear experts

Sparsity allows scaling model parameters without proportionally increasing computational cost. While mixture of experts (MoE) models are made increasingly sparse, individual experts typically remain large and dense. Here, we demonstrate that further increasing sparsity by shrinki…
arXiv cs.LG TIER_1 English(EN) · Duc Anh Nguyen, Huu Binh Ta, Nhuan Le Duc, Tan Minh Nguyen, Toan Tran · 2026-06-05 04:00

Selective Sinkhorn Routing for Improved Sparse Mixture of Experts

arXiv:2511.08972v2 Announce Type: replace Abstract: Sparse Mixture-of-Experts (SMoE) models are scalable and computationally efficient, enabling large increases in model capacity with limited inference overhead. Existing SMoE methods often depend on auxiliary objectives, such as …
arXiv cs.CL TIER_1 English(EN) · Hancheol Park, Geonho Lee, Tairen Piao, Tae-Ho Kim · 2026-06-05 04:00

Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models

arXiv:2606.05688v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models scale foundation models efficiently by activating only a subset of experts for each token, but their large number of expert parameters still makes quantization essential for practical deployment. Unli…
arXiv cs.CL TIER_1 English(EN) · Jie Cao, Zhenxuan Fan, Zhuonan Wang, Tianwei Lin, Ziyuan Zhao, Rolan Yan, Wenqiao Zhang, Feifei Shao, Hongwei Wang, Jun Xiao, Siliang Tang · 2026-06-05 04:00

CoMoL: Efficient Mixture of LoRA Experts via Dynamic Core Space Merging

arXiv:2603.00573v2 Announce Type: replace Abstract: Large language models (LLMs) achieve remarkable performance on diverse downstream and domain-specific tasks via parameter-efficient fine-tuning (PEFT). However, existing PEFT methods, particularly MoE-LoRA architectures, suffer …
arXiv cs.CL TIER_1 English(EN) · Shaohua Li, Xiuchao Sui, Xiaobing Sun, Yuhang Wu, Liangli Zhen, Yong Liu, Rick Siow Mong Goh · 2026-06-02 04:00

Confidence-Adaptive SwiGLU for Mixture-of-Experts

arXiv:2606.00761v1 Announce Type: cross Abstract: SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of the gating function -- is typically fixed throughout training. In this work, we propose Confiden…
arXiv cs.AI TIER_1 English(EN) · Jiarui Feng, Hanqing Zeng, Karish Grover, Ruizhong Qiu, Yinglong Xia, Qiang Zhang, Qifan Wang, Ren Chen, Dongqi Fu, Jiayi Liu, Zhoukai Zhao, Xiangjun Fan, Benyu Zhang, Yixin Chen · 2026-06-02 04:00

DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts

arXiv:2606.01062v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models have become a leading approach for decoupling parameter count from computational cost in large language models, yet effectively scaling MoE performance remains a challenge. Prior work shows that fine-…
arXiv cs.AI TIER_1 English(EN) · Heng Zhao, Zilei Shao, Guy Van den Broeck, Zhe Zeng · 2026-06-02 04:00

ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts

arXiv:2606.01509v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) models scale by activating only a small subset of experts per token. However, training such models remains challenging because top-$k$ routing is discrete and non-differentiable, requiring gradient estimat…
arXiv cs.AI TIER_1 English(EN) · Justin Chih-Yao Chen, Sukwon Yun, Elias Stengel-Eskin, Tianlong Chen, Mohit Bansal · 2026-06-02 04:00

Skill-Based Mixture-of-Experts: Adaptive Routing for Heterogeneous Reasoning via Inferred Skills

arXiv:2503.05641v4 Announce Type: replace-cross Abstract: Combining existing pre-trained LLMs is a promising approach for diverse reasoning tasks. However, task-level expert selection is often too coarse-grained, since different instances may require different expertise. To addre…
arXiv cs.AI TIER_1 English(EN) · Shuo Wang, Shunyang Huang, Jinghui Yuan, Zhixiang Shen, Zhao Kang · 2026-06-02 04:00

Cooperation of Experts: Fusing Heterogeneous Information with Large Margin

arXiv:2505.20853v3 Announce Type: replace-cross Abstract: Fusing heterogeneous information remains a persistent challenge in modern data analysis. While significant progress has been made, existing approaches often fail to account for the inherent heterogeneity of object patterns…
arXiv cs.LG TIER_1 English(EN) · Adrian Zhao, Zhenkun Cai, Zhenyu Song, Lingfan Yu, Haozheng Fan, Jun Wu, Yida Wang, Nandita Vijaykumar · 2026-06-02 04:00

CRAFT: Fine-Grained Cost-Aware Expert Replication For Efficient Mixture-of-Experts Serving

arXiv:2603.28768v2 Announce Type: replace-cross Abstract: Mixture-of-Experts (MoE) has recently emerged as the mainstream architecture for efficiently scaling large language models while maintaining near-constant computational cost. Expert parallelism distributes parameters by pa…
arXiv cs.CL TIER_1 English(EN) · Giang Do, Hung Le, Truyen Tran · 2026-06-01 04:00

Rethinking Sparse Mixture of Experts from a Unified Perspective

arXiv:2503.22996v3 Announce Type: replace Abstract: Sparse Mixture of Experts (SMoE) models scale the capacity of models while maintaining constant computational overhead. SMoE methods fall into two categories: Token Choice, which routes each token to a fixed number of experts, a…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-30 00:00

Confidence-Adaptive SwiGLU for Mixture-of-Experts

Confidence-Aware SwiGLU adjusts expert gate sharpness in Mixture-of-Experts models based on token-level routing confidence, improving performance with minimal computational overhead.
arXiv stat.ML TIER_1 English(EN) · Dung V. Nguyen, Anh T. Nguyen, Minh H. Nguyen, Luc Q. Nguyen, Shiqi Jiang, Ethan Fetaya, Linh Duy Tran, Gal Chechik, Tan M. Nguyen · 2026-06-01 04:00

Expert Merging in Sparse Mixture of Experts with Nash Bargaining

arXiv:2510.16138v2 Announce Type: replace-cross Abstract: Existing expert merging strategies for Sparse Mixture of Experts (SMoE) typically rely on input-dependent or input-independent averaging of expert parameters, but often lack a principled weighting mechanism. In this work, …

COVERAGE [15]

RELATED ENTITIES

RELATED TOPICS