Brief · PulseAugur

RESEARCH · dev.to — LLM tag English(EN) · 2d · [3 sources]

Multi-Head Latent Attention (MLA)

Multi-Head Latent Attention (MLA) is a novel attention mechanism designed to significantly compress the KV cache in large language models. By projecting KV pairs into a low-dimensional latent space, MLA achieves substantial cache reduction, enabling models like DeepSeek-V2/V3 and Kimi K2.x to handle longer contexts and larger batch sizes with less memory. This technique alters how prefix caching and attention computations are implemented, offering a more efficient trade-off between memory usage and computational cost during model inference. AI

IMPACT Enables LLMs to process longer contexts and larger batches by drastically reducing memory requirements for the KV cache.

DeepSeek-V3
KV cache
DeepSeek-V2
Rotary Position Embedding
Grouped-Query Attention
Kimi K2.x
Multi-Head Latent Attention