Brief · PulseAugur

RESEARCH · arXiv cs.NE (Neural & Evolutionary) English(EN) · 1d · [2 sources]

Achieving Cloud-Grade SLOs for Local Mixture-of-Experts Inference through CPU-GPU Hybrid Design

Researchers have developed a CPU-GPU hybrid system designed to improve the performance of Mixture-of-Experts (MoE) models when run locally. This system addresses key limitations in local inference, such as slow prefill times and poor concurrency, by employing techniques like stream-loading prefill and disaggregating prefill-decode operations. The hybrid approach aims to deliver cloud-grade service quality for MoE models on consumer hardware, making high-quality inference more accessible without requiring datacenter infrastructure. AI

IMPACT Enables high-quality, cost-effective local deployment of large MoE models on consumer hardware.

DeepSeek-V3
Mixture-of-Experts
RTX 5090
CPU-GPU hybrid system