Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 10h

DOT-MoE: Differentiable Optimal Transport for MoEfication

Researchers have introduced DOT-MoE, a new framework that converts dense large language models into sparse Mixture of Experts (MoEs) architectures. This method frames the decomposition of dense layers as a Differentiable Optimal Transport problem, using differentiable Sinkhorn-Knopp iterations to manage expert capacity and Straight-Through Estimators for end-to-end learning of neuron-to-expert assignments and token routing. Experiments show DOT-MoE outperforms existing methods, maintaining 90% of dense model performance while halving active parameters. AI

IMPACT Enables more efficient inference for large language models by converting dense architectures to sparse MoEs.

Mixture of Experts
Large Language Models
DOT-MoE