We added W8A8 activation quantization to MLX — prefill went from 2.84s to 2.52s on M5 Pro
Mininglamp AI has developed Cider, a new SDK that enhances the MLX framework by adding W8A8 activation quantization. This optimization significantly speeds up the prefill process for large vision-language models on Apple Silicon, reducing prefill time from 2.84s to 2.52s on an M5 Pro chip. The SDK utilizes custom Metal kernels and offers performance improvements for models running through MLX, though INT8 TensorOps are limited to M5 processors and above. AI
IMPACT Improves inference speed for AI models on Apple Silicon, potentially accelerating local AI development and deployment.