Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 6d

GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems

Researchers have developed GEM, a framework designed to optimize the mapping of experts to GPUs in Mixture-of-Expert (MoE) AI models. This new approach accounts for variability in GPU performance, aiming to reduce inference latency by strategically placing experts. GEM's strategy involves distributing experts to ensure GPUs finish processing layers concurrently, thereby mitigating slowdowns caused by slower GPUs or overloaded experts. Experiments indicate that GEM can improve end-to-end latency by an average of 7.9%, with some cases showing improvements up to 16.5%. AI

IMPACT Optimizes MoE model inference, potentially reducing latency and improving efficiency for large-scale AI deployments.

GPU
Sourish Wawdhane
Mixture-of-Expert (MoE) models