Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 23h

ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling

Researchers have developed ZipMoE, a system designed to make Mixture-of-Experts (MoE) large language models more efficient for on-device deployment. ZipMoE utilizes lossless compression and a cache-affinity scheduling approach to reduce memory footprint and improve inference speed without sacrificing model accuracy. Experiments show significant reductions in latency and increases in throughput on edge devices, shifting the inference bottleneck from I/O to computation. AI

IMPACT Enables deployment of powerful MoE models on resource-constrained devices, potentially broadening AI accessibility and application scope.

Mixture-of-Experts
large language models
edge devices
Yuchen Yang
ZipMoE