PulseAugur
LIVE 01:46:37
ENTITY KV cache

KV cache

PulseAugur coverage of KV cache — every cluster mentioning KV cache across labs, papers, and developer communities, ranked by signal.

Total · 30d
16
16 over 90d
Releases · 30d
0
0 over 90d
Papers · 30d
15
15 over 90d
TIER MIX · 90D
RELATIONSHIPS
SENTIMENT · 30D

1 day(s) with sentiment data

RECENT · PAGE 1/1 · 15 TOTAL
  1. RESEARCH · CL_29321 ·

    FibQuant method offers significant KV-cache compression for LLMs

    Researchers have developed FibQuant, a novel vector quantization method designed to significantly compress the key-value (KV) cache used in large language models. This technique aims to reduce the memory traffic associa…

  2. TOOL · CL_24313 ·

    Google's TurboQuant cuts LLM memory use by 6x with no accuracy loss

    Google researchers have developed a new technique called TurboQuant that significantly reduces the memory required by large language models. By employing a two-step process involving data rotation and scalar quantizatio…

  3. RESEARCH · CL_21864 ·

    PyTorch struggles to match TensorFlow accuracy; quantization challenges persist

    A researcher found that reproducing a paper's results on the DermMNIST dataset using PyTorch yielded a 4% lower accuracy compared to the original TensorFlow implementation. This discrepancy is attributed to potential di…

  4. TOOL · CL_18041 ·

    GPU hardware analysis reveals memory bandwidth, not FLOPS, is key for LLMs

    This article explains the fundamental architecture of GPUs, focusing on how their design prioritizes memory bandwidth over raw computational power for machine learning tasks. It details how GPUs manage thousands of thre…

  5. RESEARCH · CL_18309 ·

    AdapShot optimizes LLM in-context learning with dynamic shot counts and KV cache reuse

    Researchers have introduced AdapShot, a novel approach to enhance many-shot in-context learning for large language models. This method dynamically adjusts the number of examples provided based on query difficulty, using…

  6. RESEARCH · CL_15670 ·

    New HERMES and DSCache methods improve streaming video understanding with KV cache

    Researchers have developed new methods to improve the efficiency of multimodal large language models (MLLMs) for understanding streaming video. One approach, HERMES, conceptualizes the KV cache as a hierarchical memory …

  7. RESEARCH · CL_14344 ·

    Video Generation with Predictive Latents

    Researchers have developed several new methods to improve the efficiency and quality of visual generative models. DC-DiT introduces dynamic chunking to Diffusion Transformers, adaptively compressing visual data for fast…

  8. SIGNIFICANT · CL_13509 ·

    Google's Gemma 4 models achieve 3x speed boost with speculative decoding

    Google has released Multi-Token Prediction (MTP) drafters for its Gemma 4 open models, which can increase inference speed by up to three times. This advancement utilizes a speculative decoding architecture, allowing a l…

  9. RESEARCH · CL_11925 ·

    FluxMoE system decouples expert weights for faster LLM serving

    Researchers have developed FluxMoE, a new system designed to improve the efficiency of serving Mixture-of-Experts (MoE) models. FluxMoE addresses the challenge of large parameter sizes in MoE models by decoupling expert…

  10. RESEARCH · CL_10188 ·

    New theory unifies KV cache eviction for LLMs, improving long-context generation

    Researchers have developed a new method for managing KV cache eviction in large language models, drawing inspiration from the Information Bottleneck principle. This approach, named CapKV, aims to preserve the most predi…

  11. RESEARCH · CL_06270 ·

    Kwai Summary Attention compresses historical contexts for efficient long-context LLMs

    Researchers have introduced Kwai Summary Attention (KSA), a novel attention mechanism designed to address the quadratic time complexity of standard softmax attention in large language models. KSA aims to maintain a line…

  12. RESEARCH · CL_14463 ·

    New research explores efficient LLM inference through sparse caching, batching, and secure computation.

    Multiple research papers are exploring novel techniques to enhance the efficiency and performance of Large Language Model (LLM) inference and training. These advancements include queueing-theoretic frameworks for stabil…

  13. RESEARCH · CL_05008 ·

    New architectures and frameworks target LLM serving bottlenecks for long contexts

    Researchers have developed novel architectures and techniques to address the escalating latency and energy consumption challenges in serving large language models (LLMs) with long contexts. One approach, AMMA, proposes …

  14. RESEARCH · CL_01025 ·

    LLM inference speed-ups explained with KV cache coding tutorials

    The KV cache is a crucial technique for optimizing the inference speed of Large Language Models (LLMs) in production environments. It works by storing and reusing intermediate key and value computations, thereby avoidin…

  15. COMMENTARY · CL_04685 ·

    Transformer consciousness: Speculative notes explore AI experience and attention mechanics

    A speculative essay explores the potential for consciousness within Transformer models, suggesting that the experience of generating text (decode) is identical to the process of feeding text in (prefill). This perspecti…