PulseAugur
实时 11:03:28

新的GEMQ方法优化MoE LLM的内存和速度

研究人员开发了GEMQ,一种专为混合专家(MoE)大型语言模型(LLM)设计的混合精度量化新方法。该方法通过根据各个专家的重要性智能分配比特宽度,来解决MoE模型显著的内存开销问题。GEMQ利用全局线性规划方法进行重要性估计,并包含路由器微调以适应量化后的专家,从而在最小的精度损失下减少内存使用并加快推理速度。 AI

影响 降低了MoE LLM的内存占用并加速了推理,可能使其能够更广泛地部署和使用这些强大的模型。

排序理由 发布了一篇关于优化LLM新方法的学术论文。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →

报道来源 [3]

  1. arXiv cs.AI TIER_1 English(EN) · Dongwei Wang, Jinhee Kim, Seokho Han, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno, KhayTze Peong, Kang Eun Jeon, Jong Hwan Ko, Yiran Chen, Huanrui Yang ·

    MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Any-Precision LLM

    arXiv:2602.20191v2 Announce Type: replace-cross Abstract: Dynamic runtime latency and memory constraints necessitate flexible large language model (LLM) deployment, where an LLM can be inferred with various quantization precisions based on available computational resources. Recen…

  2. arXiv cs.CL TIER_1 English(EN) · Jianing Deng, Song Wang, Dongwei Wang, Zijie Liu, Tianlong Chen, Huanrui Yang, Jingtong Hu ·

    GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

    arXiv:2605.23078v1 Announce Type: cross Abstract: Mixture-of-Experts Large Language Models (MoE-LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed-precision quantization mitigates this cost by allocating expert-wise bit-…

  3. arXiv cs.CL TIER_1 English(EN) · Jingtong Hu ·

    GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

    Mixture-of-Experts Large Language Models (MoE-LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed-precision quantization mitigates this cost by allocating expert-wise bit-widths based on their importance, approaching the …