KV cache
PulseAugur coverage of KV cache — every cluster mentioning KV cache across labs, papers, and developer communities, ranked by signal.
9 天有情绪数据
-
Video Generation with Predictive Latents
Researchers have developed several new methods to improve the efficiency and quality of visual generative models. DC-DiT introduces dynamic chunking to Diffusion Transformers, adaptively compressing visual data for fast…
-
Google's Gemma 4 models achieve 3x speed boost with speculative decoding
Google has released Multi-Token Prediction (MTP) drafters for its Gemma 4 open models, which can increase inference speed by up to three times. This advancement utilizes a speculative decoding architecture, allowing a l…
-
FluxMoE system decouples expert weights for faster LLM serving
Researchers have developed FluxMoE, a new system designed to improve the efficiency of serving Mixture-of-Experts (MoE) models. FluxMoE addresses the challenge of large parameter sizes in MoE models by decoupling expert…
-
New theory unifies KV cache eviction for LLMs, improving long-context generation
Researchers have developed a new method for managing KV cache eviction in large language models, drawing inspiration from the Information Bottleneck principle. This approach, named CapKV, aims to preserve the most predi…
-
Kwai Summary Attention compresses historical contexts for efficient long-context LLMs
Researchers have introduced Kwai Summary Attention (KSA), a novel attention mechanism designed to address the quadratic time complexity of standard softmax attention in large language models. KSA aims to maintain a line…
-
New research explores LLM security, efficiency, and training optimization
Researchers are developing novel methods to enhance the efficiency and security of Large Language Models (LLMs). One approach, "Widening the Gap," exploits outlier injection to compromise LLM quantization, demonstrating…
-
New architectures and frameworks target LLM serving bottlenecks for long contexts
Researchers have developed novel architectures and techniques to address the escalating latency and energy consumption challenges in serving large language models (LLMs) with long contexts. One approach, AMMA, proposes …
-
新研究通过先进的压缩和存储技术解决大语言模型KV缓存瓶颈
2026年5月发表的多篇研究论文介绍了优化大语言模型键值(KV)缓存的新技术,以解决内存和延迟瓶颈。这些方法包括将KV缓存卸载到S3等对象存储(ObjectCache),采用三向令牌路由(VECTOR)等高级压缩策略,以及使用辅助模型进行选择性KV缓存重新计算(CacheClip)。其他方法侧重于硬件感知量化(InnerQ, OCTOPUS)和服务感知自适应压缩(KVServe),以提高效率并降低解码延迟,尤其适用于长上下文推理和检索…
-
LLM inference speed-ups explained with KV cache coding tutorials
The KV cache is a crucial technique for optimizing the inference speed of Large Language Models (LLMs) in production environments. It works by storing and reusing intermediate key and value computations, thereby avoidin…
-
Transformer consciousness: Speculative notes explore AI experience and attention mechanics
A speculative essay explores the potential for consciousness within Transformer models, suggesting that the experience of generating text (decode) is identical to the process of feeding text in (prefill). This perspecti…