ENTITY KV cache

KV cache

PulseAugur coverage of KV cache — every cluster mentioning KV cache across labs, papers, and developer communities, ranked by signal.

Total · 30d

16

16 over 90d

Releases · 30d

0

0 over 90d

Papers · 30d

15

15 over 90d

TIER MIX · 90D

significant 1
research 8
tool 6
commentary 1

RELATIONSHIPS

used by graphics processing unit 70%

SENTIMENT · 30D

1 day(s) with sentiment data

RECENT · PAGE 1/1 · 15 TOTAL

RESEARCH · CL_29321 · May 12 · 03:45

FibQuant method offers significant KV-cache compression for LLMs

Researchers have developed FibQuant, a novel vector quantization method designed to significantly compress the key-value (KV) cache used in large language models. This technique aims to reduce the memory traffic associa…
TOOL · CL_24313 · May 9 · 16:31

Google's TurboQuant cuts LLM memory use by 6x with no accuracy loss

Google researchers have developed a new technique called TurboQuant that significantly reduces the memory required by large language models. By employing a two-step process involving data rotation and scalar quantizatio…
RESEARCH · CL_21864 · May 8 · 03:04

PyTorch struggles to match TensorFlow accuracy; quantization challenges persist

A researcher found that reproducing a paper's results on the DermMNIST dataset using PyTorch yielded a 4% lower accuracy compared to the original TensorFlow implementation. This discrepancy is attributed to potential di…
TOOL · CL_18041 · May 5 · 22:01

GPU hardware analysis reveals memory bandwidth, not FLOPS, is key for LLMs

This article explains the fundamental architecture of GPUs, focusing on how their design prioritizes memory bandwidth over raw computational power for machine learning tasks. It details how GPUs manage thousands of thre…
RESEARCH · CL_18309 · May 5 · 11:16

AdapShot optimizes LLM in-context learning with dynamic shot counts and KV cache reuse

Researchers have introduced AdapShot, a novel approach to enhance many-shot in-context learning for large language models. This method dynamically adjusts the number of examples provided based on query difficulty, using…
RESEARCH · CL_15670 · May 5 · 04:00

New HERMES and DSCache methods improve streaming video understanding with KV cache

Researchers have developed new methods to improve the efficiency of multimodal large language models (MLLMs) for understanding streaming video. One approach, HERMES, conceptualizes the KV cache as a hierarchical memory …
RESEARCH · CL_14344 · May 4 · 04:00

Video Generation with Predictive Latents

Researchers have developed several new methods to improve the efficiency and quality of visual generative models. DC-DiT introduces dynamic chunking to Diffusion Transformers, adaptively compressing visual data for fast…
SIGNIFICANT · CL_13509 · May 3 · 08:12

Google's Gemma 4 models achieve 3x speed boost with speculative decoding

Google has released Multi-Token Prediction (MTP) drafters for its Gemma 4 open models, which can increase inference speed by up to three times. This advancement utilizes a speculative decoding architecture, allowing a l…
RESEARCH · CL_11925 · May 1 · 04:00

FluxMoE system decouples expert weights for faster LLM serving

Researchers have developed FluxMoE, a new system designed to improve the efficiency of serving Mixture-of-Experts (MoE) models. FluxMoE addresses the challenge of large parameter sizes in MoE models by decoupling expert…
RESEARCH · CL_10188 · Apr 30 · 04:00

New theory unifies KV cache eviction for LLMs, improving long-context generation

Researchers have developed a new method for managing KV cache eviction in large language models, drawing inspiration from the Information Bottleneck principle. This approach, named CapKV, aims to preserve the most predi…
RESEARCH · CL_06270 · Apr 27 · 12:59

Kwai Summary Attention compresses historical contexts for efficient long-context LLMs

Researchers have introduced Kwai Summary Attention (KSA), a novel attention mechanism designed to address the quadratic time complexity of standard softmax attention in large language models. KSA aims to maintain a line…
RESEARCH · CL_14463 · Apr 27 · 04:00

New research explores efficient LLM inference through sparse caching, batching, and secure computation.

Multiple research papers are exploring novel techniques to enhance the efficiency and performance of Large Language Model (LLM) inference and training. These advancements include queueing-theoretic frameworks for stabil…
RESEARCH · CL_05008 · Apr 23 · 20:12

New architectures and frameworks target LLM serving bottlenecks for long contexts

Researchers have developed novel architectures and techniques to address the escalating latency and energy consumption challenges in serving large language models (LLMs) with long contexts. One approach, AMMA, proposes …
RESEARCH · CL_01025 · Jun 17 · 10:55

LLM inference speed-ups explained with KV cache coding tutorials

The KV cache is a crucial technique for optimizing the inference speed of Large Language Models (LLMs) in production environments. It works by storing and reusing intermediate key and value computations, thereby avoidin…
COMMENTARY · CL_04685 · May 21 · 00:00

Transformer consciousness: Speculative notes explore AI experience and attention mechanics

A speculative essay explores the potential for consciousness within Transformer models, suggesting that the experience of generating text (decode) is identical to the process of feeding text in (prefill). This perspecti…