Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention
Sebastian Raschka's analysis highlights recent architectural innovations in open-weight LLMs aimed at improving long-context efficiency. Key developments include KV sharing and per-layer embeddings in Google's Gemma 4 models, layer-wise attention budgeting in Laguna XS.2, and compressed convolutional attention in ZAYA1-8B. DeepSeek V4 also incorporates mHC and compressed attention, addressing the growing constraints of KV cache size and memory traffic as models handle longer contexts for reasoning and agent workflows. AI
IMPACT New architectural techniques in open-weight LLMs are improving efficiency for long contexts, potentially enabling more complex reasoning and agent capabilities.