MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
Researchers have introduced MACS, a new inference framework designed to improve the efficiency of Mixture-of-Experts Multimodal Large Language Models (MoE MLLMs). MACS addresses the straggler effect during expert parallelism inference by introducing an Entropy-Weighted Load mechanism to better value visual tokens and a Dynamic Modality-Adaptive Capacity mechanism for real-time expert resource allocation. Experiments show MACS significantly outperforms existing methods on multimodal benchmarks, offering a robust solution for deploying MoE MLLMs. AI
IMPACT Offers a novel solution for efficient deployment of MoE MLLMs, potentially reducing inference costs and latency.