English(EN) GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems

GEM 框架优化 MoE AI 模型 GPU 映射以加快推理速度

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-19 15:01

研究人员开发了 GEM，一个旨在优化混合专家 (MoE) AI 模型中专家到 GPU 映射的框架。这种新方法考虑了 GPU 性能的变异性，旨在通过战略性地放置专家来减少推理延迟。GEM 的策略涉及分配专家，以确保 GPU 同时完成层处理，从而缓解由较慢 GPU 或过载专家引起的减速。实验表明，GEM 的端到端延迟平均可提高 7.9%，在某些情况下提高幅度高达 16.5%。 AI

影响优化 MoE 模型推理，可能降低大规模 AI 部署的延迟并提高效率。

排序理由发布关于新型 AI 系统优化技术的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Poulami Das · 2026-05-19 15:01

GEM: 针对MoE系统的GPU-变异性感知专家到GPU映射

Mixture-of-Expert (MoE) models enable efficient inference by employing smaller experts and activating only a subset of them per token. MoE serving engines distribute experts across multiple GPUs and route tokens to appropriate GPUs at inference time based on experts activated. Th…

报道来源 [1]

GEM: 针对MoE系统的GPU-变异性感知专家到GPU映射

相关实体

相关话题