PulseAugur
实时 22:48:23

New MoE inference design uses pooled HBM to cut communication latency on Ascend

Researchers have developed a new communication design for Mixture-of-Experts (MoE) inference on Ascend systems, aiming to reduce bottlenecks in token exchange. This approach eliminates intermediate relay and reordering buffers by directly placing data into destination expert windows and reading from remote ones. The system leverages globally pooled high-bandwidth memory and symmetric memory allocation, resulting in improved time to first token and competitive time per output token for MoE workloads. AI

影响 This research could lead to more efficient inference for large MoE models on specific hardware platforms.

排序理由 This is a research paper detailing a novel technical approach for optimizing MoE inference. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

New MoE inference design uses pooled HBM to cut communication latency on Ascend

报道来源 [1]

  1. arXiv cs.LG TIER_1 English(EN) · Tianlun Hu, Tiancheng Hu, Shengsheng Litang, Sheng Wang, Xiaoming Bao, Yuxing Li, Wei Wang, Zhongzhe Hu, Lijun Li, Hongwei Sun, Jingbin Zhou\\ ·

    Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend

    arXiv:2605.06055v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) inference requires large-scale token exchange across devices, making dispatch and combine major bottlenecks in both prefill and decode. Beyond network transfer, routing-driven layout transformation, tempor…