GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs
Researchers have developed GEMQ, a novel method for mixed-precision quantization specifically designed for Mixture-of-Experts (MoE) Large Language Models. This approach addresses the significant memory overhead of MoE models by intelligently allocating bit-widths to individual experts based on their importance. GEMQ utilizes a global linear-programming formulation for importance estimation and includes router fine-tuning to adapt to quantized experts, leading to reduced memory usage and faster inference with minimal accuracy loss. AI
IMPACT Reduces memory footprint and accelerates inference for MoE LLMs, potentially enabling wider deployment and use of these powerful models.