Achieving Cloud-Grade SLOs for Local Mixture-of-Experts Inference through CPU-GPU Hybrid Design
Researchers have developed a CPU-GPU hybrid system designed to improve the performance of Mixture-of-Experts (MoE) models when run locally. This system addresses key limitations in local inference, such as slow prefill times and poor concurrency, by employing techniques like stream-loading prefill and disaggregating prefill-decode operations. The hybrid approach aims to deliver cloud-grade service quality for MoE models on consumer hardware, making high-quality inference more accessible without requiring datacenter infrastructure. AI
IMPACT Enables high-quality, cost-effective local deployment of large MoE models on consumer hardware.