Multi-Head Latent Attention (MLA) is an attention mechanism that significantly compresses the KV cache in large language models. By projecting KV pairs into a low-dimensional latent space, MLA achieves a 5-10x reduction in cache size with minimal impact on model quality. This compression enables models like DeepSeek-V2 and Kimi K2.x to handle much longer context windows and larger batch sizes during inference. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Enables significantly longer context windows and larger batch sizes for LLMs by reducing memory requirements.
RANK_REASON Technical paper detailing a novel mechanism for LLM KV cache compression. [lever_c_demoted from research: ic=1 ai=1.0]