KV cache
PulseAugur coverage of KV cache — every cluster mentioning KV cache across labs, papers, and developer communities, ranked by signal.
9 天有情绪数据
-
New MLA attention mechanism slashes LLM KV cache by up to 10x
Multi-Head Latent Attention (MLA) is a novel attention mechanism designed to significantly compress the KV cache in large language models. By projecting KV pairs into a low-dimensional latent space, MLA achieves substan…
-
Fixing local LLM OOM errors by optimizing KV cache and quantization
Running large open-source language models locally can lead to out-of-memory errors, even if the model's weights seem to fit within the available VRAM. This is primarily due to the significant memory required for the KV …
-
TurboQuant uses PolarQuant to compress LLM KV cache by 4.2x
A technical deep dive explains the inner workings of TurboQuant, a novel method for compressing large language model KV caches. TurboQuant utilizes a technique called PolarQuant, which transforms KV embeddings into pola…
-
TurboQuant paper tackles LLM KV cache problem
A recent paper introduces TurboQuant, a novel method for optimizing the KV cache in large language models. This technique aims to significantly reduce memory usage and improve inference speed. The research explores the …
-
LLM benchmarks mislead on inference speed for long contexts
Current LLM inference benchmarks are misleading because they primarily measure short-context performance, which does not reflect real-world usage involving longer contexts. This discrepancy arises from the differing com…
-
KV Cache Optimization Solves LLM GPU Memory Bottleneck
Large language models (LLMs) face a significant bottleneck in serving efficiency due to the memory demands of KV cache, which stores intermediate attention calculations. This KV cache, essential for enabling faster resp…
-
KV cache eviction protection proves more vital than scoring
Researchers have developed a new method for managing KV cache eviction in large language models, finding that structural protection is more critical than scoring algorithms. Their study on transformer models revealed th…
-
DeepSeek V4 launches with 1.6T MoE, 1M context, and lower costs
DeepSeek V4, an open-weight model family, has been released with a 1.6-trillion-parameter Mixture-of-Experts architecture that activates only 49 billion parameters per token. This new model boasts a 1-million-token cont…
-
GitHub cuts agent workflow costs tenfold with KV cache optimization
GitHub has developed a method to significantly reduce the cost of agentic workflows by optimizing the KV cache. This approach involves trading VRAM for compute, allowing for a tenfold reduction in expenses. The techniqu…
-
New methods enhance AI attention efficiency for video and LLMs
Researchers have developed several new methods to improve the efficiency of attention mechanisms in AI models. One approach, SimInsert, focuses on seamless video object insertion by decoupling single-frame editing from …
-
AI Inference Systems Optimize for Real-Time with Speculative Decoding
This article delves into the technical aspects of optimizing AI inference for real-time applications. It highlights the growing importance of minimizing latency as a core architectural consideration. The piece further e…
-
New KV-cache compression method alpha outperforms existing techniques
Researchers have developed a new KV-cache compression method called alpha, which uses a diversity-penalty survivor approach. This method was found to outperform seven other mechanisms in a design-space study on mathemat…
-
Pyramid Forcing improves long video generation with head-aware cache policy
Researchers have introduced Pyramid Forcing, a novel KV cache policy designed to enhance the quality of long video generation. This method addresses the issue of accumulated errors in autoregressive video synthesis by r…
-
FibQuant method offers significant KV-cache compression for LLMs
Researchers have developed FibQuant, a novel vector quantization method designed to significantly compress the key-value (KV) cache used in large language models. This technique aims to reduce the memory traffic associa…
-
Google's TurboQuant cuts LLM memory use by 6x with no accuracy loss
Google researchers have developed a new technique called TurboQuant that significantly reduces the memory required by large language models. By employing a two-step process involving data rotation and scalar quantizatio…
-
PyTorch struggles to match TensorFlow accuracy; quantization challenges persist
A researcher found that reproducing a paper's results on the DermMNIST dataset using PyTorch yielded a 4% lower accuracy compared to the original TensorFlow implementation. This discrepancy is attributed to potential di…
-
GPU hardware analysis reveals memory bandwidth, not FLOPS, is key for LLMs
This article explains the fundamental architecture of GPUs, focusing on how their design prioritizes memory bandwidth over raw computational power for machine learning tasks. It details how GPUs manage thousands of thre…
-
AdapShot optimizes LLM in-context learning with dynamic shot counts and KV cache reuse
Researchers have introduced AdapShot, a novel approach to enhance many-shot in-context learning for large language models. This method dynamically adjusts the number of examples provided based on query difficulty, using…
-
New HERMES and DSCache methods improve streaming video understanding with KV cache
Researchers have developed new methods to improve the efficiency of multimodal large language models (MLLMs) for understanding streaming video. One approach, HERMES, conceptualizes the KV cache as a hierarchical memory …
-
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
Researchers are developing several novel methods to optimize the Key-Value (KV) cache in large language models, which is a major bottleneck for long-context processing. These approaches include training models to inhere…