English(EN) Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains

JetBrains 发布高效 Mellum2 MoE 模型；研究推动 MoE 技术进步

作者 PulseAugur 编辑部 · [15 个来源] · 2026-05-30 00:00

JetBrains 发布了 Mellum2，一个开源的 120 亿参数专家混合（MoE）模型，针对文本和代码任务的高效推理进行了优化。该模型每个 token 只激活其参数的一小部分，从而实现更快、更低延迟的操作，适用于大型 AI 系统中的路由、RAG 管道和子代理任务。多篇研究论文还探讨了 MoE 架构的进展，包括 CRAFT 等高效服务技术、DAG-MoE 等新颖聚合方法、Kappa-SwiGLU 的自适应门控以及 ProbMoE 的概率路由，以及受博弈论启发的专家合并策略。 AI

影响 Mellum2 的效率和专业化设计为大型 AI 系统中的特定任务提供了更快、更便宜的替代方案，有可能加速模块化 AI 架构的采用。

排序理由 JetBrains 发布了一个新模型 Mellum2，这是一个 12B 参数的专家混合模型。

在 Hugging Face Blog 阅读 →

AI 生成摘要 · Google Gemini · 来自 15 个来源。我们如何撰写摘要 →

JetBrains 发布高效 Mellum2 MoE 模型；研究推动 MoE 技术进步

报道来源 [15]

Hugging Face Blog TIER_1 English(EN) · 2026-06-01 15:45

推出 Mellum2：JetBrains 的 12B 混合专家模型
arXiv cs.LG TIER_1 English(EN) · Simon Schug · 2026-06-08 04:00

稀疏门控微型线性专家

arXiv:2606.07414v1 Announce Type: new Abstract: Sparsity allows scaling model parameters without proportionally increasing computational cost. While mixture of experts (MoE) models are made increasingly sparse, individual experts typically remain large and dense. Here, we demonst…
arXiv cs.LG TIER_1 English(EN) · Simon Schug · 2026-06-05 16:06

稀疏门控微型线性专家

Sparsity allows scaling model parameters without proportionally increasing computational cost. While mixture of experts (MoE) models are made increasingly sparse, individual experts typically remain large and dense. Here, we demonstrate that further increasing sparsity by shrinki…
arXiv cs.LG TIER_1 English(EN) · Duc Anh Nguyen, Huu Binh Ta, Nhuan Le Duc, Tan Minh Nguyen, Toan Tran · 2026-06-05 04:00

选择性Sinkhorn路由以改进稀疏专家混合模型

arXiv:2511.08972v2 Announce Type: replace Abstract: Sparse Mixture-of-Experts (SMoE) models are scalable and computationally efficient, enabling large increases in model capacity with limited inference overhead. Existing SMoE methods often depend on auxiliary objectives, such as …
arXiv cs.CL TIER_1 English(EN) · Hancheol Park, Geonho Lee, Tairen Piao, Tae-Ho Kim · 2026-06-05 04:00

面向混合专家模型的路由一致量化中的值与结构对齐

arXiv:2606.05688v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models scale foundation models efficiently by activating only a subset of experts for each token, but their large number of expert parameters still makes quantization essential for practical deployment. Unli…
arXiv cs.CL TIER_1 English(EN) · Jie Cao, Zhenxuan Fan, Zhuonan Wang, Tianwei Lin, Ziyuan Zhao, Rolan Yan, Wenqiao Zhang, Feifei Shao, Hongwei Wang, Jun Xiao, Siliang Tang · 2026-06-05 04:00

CoMoL：通过动态核心空间合并实现高效 LoRA 专家混合

arXiv:2603.00573v2 Announce Type: replace Abstract: Large language models (LLMs) achieve remarkable performance on diverse downstream and domain-specific tasks via parameter-efficient fine-tuning (PEFT). However, existing PEFT methods, particularly MoE-LoRA architectures, suffer …
arXiv cs.CL TIER_1 English(EN) · Shaohua Li, Xiuchao Sui, Xiaobing Sun, Yuhang Wu, Liangli Zhen, Yong Liu, Rick Siow Mong Goh · 2026-06-02 04:00

用于混合专家的置信度自适应SwiGLU

arXiv:2606.00761v1 Announce Type: cross Abstract: SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of the gating function -- is typically fixed throughout training. In this work, we propose Confiden…
arXiv cs.AI TIER_1 English(EN) · Jiarui Feng, Hanqing Zeng, Karish Grover, Ruizhong Qiu, Yinglong Xia, Qiang Zhang, Qifan Wang, Ren Chen, Dongqi Fu, Jiayi Liu, Zhoukai Zhao, Xiangjun Fan, Benyu Zhang, Yixin Chen · 2026-06-02 04:00

DAG-MoE：从简单混合到专家混合中的结构聚合

arXiv:2606.01062v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models have become a leading approach for decoupling parameter count from computational cost in large language models, yet effectively scaling MoE performance remains a challenge. Prior work shows that fine-…
arXiv cs.AI TIER_1 English(EN) · Heng Zhao, Zilei Shao, Guy Van den Broeck, Zhe Zeng · 2026-06-02 04:00

ProbMoE：用于混合专家模型的可微分概率路由

arXiv:2606.01509v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) models scale by activating only a small subset of experts per token. However, training such models remains challenging because top-$k$ routing is discrete and non-differentiable, requiring gradient estimat…
arXiv cs.AI TIER_1 English(EN) · Justin Chih-Yao Chen, Sukwon Yun, Elias Stengel-Eskin, Tianlong Chen, Mohit Bansal · 2026-06-02 04:00

基于技能的专家混合模型：通过推理技能实现异构推理的自适应路由

arXiv:2503.05641v4 Announce Type: replace-cross Abstract: Combining existing pre-trained LLMs is a promising approach for diverse reasoning tasks. However, task-level expert selection is often too coarse-grained, since different instances may require different expertise. To addre…
arXiv cs.AI TIER_1 English(EN) · Shuo Wang, Shunyang Huang, Jinghui Yuan, Zhixiang Shen, Zhao Kang · 2026-06-02 04:00

专家合作：融合异构信息与大间隔

arXiv:2505.20853v3 Announce Type: replace-cross Abstract: Fusing heterogeneous information remains a persistent challenge in modern data analysis. While significant progress has been made, existing approaches often fail to account for the inherent heterogeneity of object patterns…
arXiv cs.LG TIER_1 English(EN) · Adrian Zhao, Zhenkun Cai, Zhenyu Song, Lingfan Yu, Haozheng Fan, Jun Wu, Yida Wang, Nandita Vijaykumar · 2026-06-02 04:00

CRAFT：细粒度成本感知专家复制，用于高效的专家混合服务

arXiv:2603.28768v2 Announce Type: replace-cross Abstract: Mixture-of-Experts (MoE) has recently emerged as the mainstream architecture for efficiently scaling large language models while maintaining near-constant computational cost. Expert parallelism distributes parameters by pa…
arXiv cs.CL TIER_1 English(EN) · Giang Do, Hung Le, Truyen Tran · 2026-06-01 04:00

从统一视角重新思考稀疏专家混合模型

arXiv:2503.22996v3 Announce Type: replace Abstract: Sparse Mixture of Experts (SMoE) models scale the capacity of models while maintaining constant computational overhead. SMoE methods fall into two categories: Token Choice, which routes each token to a fixed number of experts, a…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-30 00:00

用于混合专家的置信度自适应SwiGLU

Confidence-Aware SwiGLU adjusts expert gate sharpness in Mixture-of-Experts models based on token-level routing confidence, improving performance with minimal computational overhead.
arXiv stat.ML TIER_1 English(EN) · Dung V. Nguyen, Anh T. Nguyen, Minh H. Nguyen, Luc Q. Nguyen, Shiqi Jiang, Ethan Fetaya, Linh Duy Tran, Gal Chechik, Tan M. Nguyen · 2026-06-01 04:00

专家在稀疏专家混合模型中应用纳什议价

arXiv:2510.16138v2 Announce Type: replace-cross Abstract: Existing expert merging strategies for Sparse Mixture of Experts (SMoE) typically rely on input-dependent or input-independent averaging of expert parameters, but often lack a principled weighting mechanism. In this work, …

报道来源 [15]

相关实体

相关话题