English(EN) Mixture-of-Parallelisms: Towards Memory-Efficient Training Stack for Mixture-of-Experts Models

新的MoP训练栈支持万亿参数MoE模型和1M上下文

作者 PulseAugur 编辑部 · [1 个来源] · 2026-07-03 04:00

研究人员推出了一种名为Mixture-of-Parallelisms (MoP) 的新型训练栈，旨在提高专家混合 (MoE) 模型的内存效率。该方法将各种现有和新的并行技术整合到MoE训练流水线的不同层和阶段。MoP针对CPU、GPU内存和通信带宽进行优化，使得使用相对较小的128x H200 GPU集群即可训练具有百万token上下文长度的万亿参数模型。实验结果表明，与标准基线相比，MoP实现了显著更高的每GPU吞吐量，并支持更长的上下文长度。 AI

影响这种新的训练栈可能会显著降低训练大型MoE模型的硬件要求，从而加速该领域的研究和开发。

排序理由该集群包含一篇详细介绍AI模型训练新方法的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Xuan-Phi Nguyen, Shrey Pandit, Yiran Zhao, Semih Yavuz, Silvio Savarese, Shafiq Joty · 2026-07-03 04:00

Mixture-of-Parallelisms: Towards Memory-Efficient Training Stack for Mixture-of-Experts Models

arXiv:2607.01844v1 Announce Type: cross Abstract: This paper showcases a memory-efficient training stack for Mixture-of-Experts (MoE) models. It is a training paradigm that combines and specializes various existing and novel parallelism techniques at different layers and stages o…

报道来源 [1]

Mixture-of-Parallelisms: Towards Memory-Efficient Training Stack for Mixture-of-Experts Models

相关实体

相关话题