Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 10h

MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

Researchers have introduced MODE, a novel quantization framework designed to reduce the significant memory costs associated with Mixture-of-Experts Multimodal Large Language Models (MoE-MLLMs). The framework addresses biases in expert importance estimation that hinder performance in existing methods. By decomposing expert selection frequency by modality and filtering redundant vision tokens, MODE aims to improve the accuracy of quantization, especially for text-critical experts. Experiments demonstrate that MODE achieves substantial compression, with minimal performance loss even at extreme bit-width settings. AI

IMPACT Reduces memory footprint for MoE-MLLMs, potentially enabling wider deployment and experimentation with these powerful models.

graphics processing unit
MoE-MLLMs
MoE-LLMs
integer linear programming
Mixture-of-Experts Multimodal Large Language Models
PTQ methods
W3A16