KV cache
PulseAugur coverage of KV cache — every cluster mentioning KV cache across labs, papers, and developer communities, ranked by signal.
1 day(s) with sentiment data
-
FibQuant method offers significant KV-cache compression for LLMs
Researchers have developed FibQuant, a novel vector quantization method designed to significantly compress the key-value (KV) cache used in large language models. This technique aims to reduce the memory traffic associa…
-
Google's TurboQuant cuts LLM memory use by 6x with no accuracy loss
Google researchers have developed a new technique called TurboQuant that significantly reduces the memory required by large language models. By employing a two-step process involving data rotation and scalar quantizatio…
-
PyTorch struggles to match TensorFlow accuracy; quantization challenges persist
A researcher found that reproducing a paper's results on the DermMNIST dataset using PyTorch yielded a 4% lower accuracy compared to the original TensorFlow implementation. This discrepancy is attributed to potential di…
-
GPU hardware analysis reveals memory bandwidth, not FLOPS, is key for LLMs
This article explains the fundamental architecture of GPUs, focusing on how their design prioritizes memory bandwidth over raw computational power for machine learning tasks. It details how GPUs manage thousands of thre…
-
AdapShot optimizes LLM in-context learning with dynamic shot counts and KV cache reuse
Researchers have introduced AdapShot, a novel approach to enhance many-shot in-context learning for large language models. This method dynamically adjusts the number of examples provided based on query difficulty, using…
-
New HERMES and DSCache methods improve streaming video understanding with KV cache
Researchers have developed new methods to improve the efficiency of multimodal large language models (MLLMs) for understanding streaming video. One approach, HERMES, conceptualizes the KV cache as a hierarchical memory …
-
Video Generation with Predictive Latents
Researchers have developed several new methods to improve the efficiency and quality of visual generative models. DC-DiT introduces dynamic chunking to Diffusion Transformers, adaptively compressing visual data for fast…
-
Google's Gemma 4 models achieve 3x speed boost with speculative decoding
Google has released Multi-Token Prediction (MTP) drafters for its Gemma 4 open models, which can increase inference speed by up to three times. This advancement utilizes a speculative decoding architecture, allowing a l…
-
FluxMoE system decouples expert weights for faster LLM serving
Researchers have developed FluxMoE, a new system designed to improve the efficiency of serving Mixture-of-Experts (MoE) models. FluxMoE addresses the challenge of large parameter sizes in MoE models by decoupling expert…
-
New theory unifies KV cache eviction for LLMs, improving long-context generation
Researchers have developed a new method for managing KV cache eviction in large language models, drawing inspiration from the Information Bottleneck principle. This approach, named CapKV, aims to preserve the most predi…
-
Kwai Summary Attention compresses historical contexts for efficient long-context LLMs
Researchers have introduced Kwai Summary Attention (KSA), a novel attention mechanism designed to address the quadratic time complexity of standard softmax attention in large language models. KSA aims to maintain a line…
-
New research explores efficient LLM inference through sparse caching, batching, and secure computation.
Multiple research papers are exploring novel techniques to enhance the efficiency and performance of Large Language Model (LLM) inference and training. These advancements include queueing-theoretic frameworks for stabil…
-
New architectures and frameworks target LLM serving bottlenecks for long contexts
Researchers have developed novel architectures and techniques to address the escalating latency and energy consumption challenges in serving large language models (LLMs) with long contexts. One approach, AMMA, proposes …
-
LLM inference speed-ups explained with KV cache coding tutorials
The KV cache is a crucial technique for optimizing the inference speed of Large Language Models (LLMs) in production environments. It works by storing and reusing intermediate key and value computations, thereby avoidin…
-
Transformer consciousness: Speculative notes explore AI experience and attention mechanics
A speculative essay explores the potential for consciousness within Transformer models, suggesting that the experience of generating text (decode) is identical to the process of feeding text in (prefill). This perspecti…