XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash
Xiaomi has released MiMo-V2.5-Pro-FP4-DFlash, a new model optimized for efficient inference. It features expert-only FP4 quantization to reduce memory footprint and bandwidth pressure while maintaining quality. The model also incorporates a BF16 DFlash drafter for speculative decoding, enabling faster token generation by proposing blocks of tokens per forward pass. AI
IMPACT Enables more efficient deployment of large language models, potentially reducing inference costs and increasing accessibility.