KV cache
PulseAugur coverage of KV cache — every cluster mentioning KV cache across labs, papers, and developer communities, ranked by signal.
- used by TurboQuant 90%
- used by Oscar 90%
- developed by Oscar 90%
- used by Rope 80%
- used by vLLM 70%
- developed by vLLM 70%
- used by graphics processing unit 70%
- used by large-language models 70%
- used by LongBench: a bilingual, multitask benchmark for long context understanding 70%
- used by Math-500 70%
- developed TurboQuant 70%
- used by speculative decoding 70%
16 day(s) with sentiment data
-
Engram pioneers AI 'memory' by baking knowledge into weights, not just context
AI startup Engram is developing a novel approach to AI memory and continual learning, aiming to embed specialized knowledge directly into model weights rather than relying solely on retrieval-augmented generation (RAG) …
-
New EpiKV method optimizes LLM KV cache, boosting efficiency and context length
A new research paper introduces EpiKV, a method for optimizing KV cache eviction in large language models. Unlike previous methods that rely on attention weights, EpiKV uses an "epiphany score" derived from changes in t…
-
ASAP framework enhances ML hyperparameter optimization via agent-system co-design
Researchers have developed ASAP, a novel agent-system co-design framework for hyperparameter optimization (HPO) in machine learning experiments. ASAP addresses limitations of existing HPO tools by integrating a diverse …
-
Nexus Sampling improves LLM KV cache eviction, reducing memory use
Researchers have developed Nexus Sampling, a novel method for managing KV cache eviction in large language models, particularly for long-context and agentic workloads. This training-free approach pairs Nexus scoring wit…
-
Kamera method enhances multimodal AI efficiency with position-invariant KV cache
Researchers have developed a new method called Kamera that addresses the inefficiency of multimodal AI agents re-encoding information from repeated video frames or UI screenshots. This technique introduces a training-fr…
-
New methods enhance LLM efficiency via KV cache compression and quantization
Researchers have developed new methods to improve the efficiency of large language models (LLMs) by compressing their key-value (KV) caches. One approach, InfoKV, uses information-theoretic signals like predictive uncer…
-
Keyless Attention mechanism halves KV cache and boosts transformer efficiency
Researchers have introduced Keyless Attention, a novel attention mechanism for transformers that eliminates the key projection entirely, operating solely on queries and values. This approach results in a Value-Only Cach…
-
KV cache memory problem plagues LLM serving, vLLM's PagedAttention offers solution
The KV cache is a critical component in LLM inference, storing past computations to avoid recomputing them for each new token. However, its memory footprint can become a significant bottleneck, especially in production …
-
Baidu releases Unlimited OCR with constant KV cache for long documents
Baidu has released Unlimited OCR, a 3-billion-parameter Mixture-of-Experts model designed for efficient long-document parsing. The model utilizes Reference Sliding Window Attention (R-SWA) to maintain a constant KV cach…
-
AWS SageMaker enhances AI inference monitoring with CloudWatch dashboard
Amazon SageMaker has enhanced its monitoring capabilities for generative AI inference endpoints by integrating detailed metrics and a new Insights dashboard within Amazon CloudWatch. This upgrade allows users to more ef…
-
New 'Execution-State Capsules' Speed Up On-Device AI Serving
Researchers have introduced "execution-state capsules," a novel method for managing and reusing the complete state of AI models during on-device serving. This approach allows for rapid checkpointing and restoration of a…
-
New research enables editable and composable KV cache for LLMs
A new research paper introduces a novel method for optimizing KV cache usage in large language models, enabling editable and composable notes within the prefill stage. This approach allows for efficient editing of model…
-
New methods boost LLM inference speed via speculative decoding · 7 sources tracked
Researchers are developing advanced speculative decoding techniques to accelerate large language model (LLM) inference. JetFlow, a new framework, improves speed by combining drafting efficiency with causal conditioning,…
-
CogGuard framework offers proactive warnings for edge AI services
Researchers have developed CogGuard, a new framework designed to provide proactive warnings for edge intelligent services. This system aims to predict task completion success while adhering to strict latency and privacy…
-
Variable-Width Transformers Offer Improved Efficiency in Language Models
Researchers have proposed a novel transformer architecture, termed the '> <former' or 'x-shaped' architecture, that deviates from the standard uniform width across all layers. This new design allocates wider capacity to…
-
LLM Architectures Prioritize Long-Context Efficiency
New large language model architectures are focusing on improving efficiency with long contexts. Recent open-weight model releases are implementing architectural modifications to decrease the size of the KV cache, which …
-
KVEraser offers efficient KV cache editing for LLMs
Researchers have developed KVEraser, a novel method for efficiently erasing specific information from the KV cache of large language models. This technique addresses the challenge of localized context editing, where rem…
-
New LLM KV Cache Compression Methods Tackle Safety and Efficiency
Researchers are developing new methods to compress the Key-Value (KV) cache in large language models (LLMs) to reduce memory usage and improve inference efficiency. AnchorKV focuses on safety by biasing token retention …
-
AI agents could buy precomputed KV caches to save compute
Researchers propose a novel method to reduce AI agent computation by precomputing and selling Key-Value (KV) caches for documents. This approach aims to eliminate redundant prefill computations, which are the most compu…
-
Qwen 3.6 35B model excels with KV cache in agentic tasks
A user on r/LocalLLaMA found that the Qwen 3.6 35B model significantly outperforms the 27B version, particularly in agentic tasks, when using KV cache. This user initially favored the 27B model for its perceived intelli…