Researchers unveil new methods to boost LLM inference speed and efficiency
ByPulseAugur Editorial·
Summary by gemini-2.5-flash-lite
from 26 sources
Google Research has introduced "speculative cascades," a novel method to enhance Large Language Model (LLM) efficiency by merging speculative decoding with standard cascades. This hybrid approach aims to reduce computational costs and inference latency without compromising output quality. By strategically using smaller models to predict tokens and then verifying them with larger models, speculative cascades offer improved cost-quality trade-offs compared to either technique used in isolation, as demonstrated with Gemma and T5 models.
AI
arXiv:2604.26968v1 Announce Type: cross Abstract: Key-value (KV) cache memory management is the primary bottleneck limiting throughput and cost-efficiency in large-scale GPU inference serving. Current systems suffer from three compounding inefficiencies: (1) the absence of unifie…
arXiv cs.LG
TIER_1·Aditya Ukarande, Deep Shekhar, Marc Blackstein, Ram Rangan·
arXiv:2604.26334v1 Announce Type: cross Abstract: To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on c…
arXiv:2604.26557v1 Announce Type: cross Abstract: The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key challenge arises from Key-Value (KV) caches, which often exceed available device me…
The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key challenge arises from Key-Value (KV) caches, which often exceed available device memory. Although NVMe-based offloading offers scalab…
To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on client systems. To address this, we present pipelin…
arXiv:2602.17697v2 Announce Type: replace Abstract: Large Language Models (LLMs) are being increasingly used across a wide range of tasks. However, their substantial computational demands raise concerns about the energy efficiency and sustainability of both training and inference…
arXiv:2604.24971v1 Announce Type: cross Abstract: We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent -- the standard paradigm -- PolyKV writes a co…
arXiv cs.CL
TIER_1·Zahra Dehghanighobadi, Asja Fischer·
arXiv:2604.24647v1 Announce Type: new Abstract: Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on th…
arXiv:2503.10666v4 Announce Type: replace Abstract: Large Language Models (LLMs) have become widely used across various domains spanning search engines, code generation, and text creation. However, a major concern associated with their adoption is the high cost of inference, impa…
arXiv cs.CL
TIER_1·Yi Su, Zhenxu Tian, Dan Qiao, Yuechi Zhou, Juntao Li, Min Zhang·
arXiv:2603.11504v2 Announce Type: replace-cross Abstract: Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer …
arXiv:2505.02922v3 Announce Type: replace Abstract: Recent large language models (LLMs) are rapidly extending their context windows, yet inference throughput lags due to increasing GPU memory and bandwidth demands. This is because the key-value (KV) cache, an intermediate structu…
We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent -- the standard paradigm -- PolyKV writes a compressed cache once and injects it into N independ…
Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the key-value (KV) cache, whose memory footprint g…
Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the key-value (KV) cache, whose memory footprint g…
arXiv:2510.25977v4 Announce Type: replace Abstract: Emerging AI accelerators have started to gain attention and offer new opportunities for efficient inference of large language models (LLMs). Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides a…
arXiv:2604.21026v2 Announce Type: replace Abstract: Deploying large language models to heterogeneous hardware is often constrained by memory, not compute. We introduce MCAP (Monte Carlo Activation Profiling), a load-time per-layer importance estimator that enables dynamic precisi…
Enjoyed the read? If you have deep experience in ML frameworks (training or inference) and love working on problems like these, our team is hiring!
ML Systems Engineer, Frameworks & Tooling: https://t.co/IyMnsfplXv
Audio Inference Engineer, Model Efficiency:
For real agentic workloads (North), short-context calibration wasn't enough. We calibrated AWQ on long internal agentic traces (up to 64k tokens) and added token masking in llm-compressor to exclude repetitive chat templates/tool descriptions from calibration stats. Plus QAD http…
🔧 The tricky part: naïvely casting BF16 group scales to FP8 dropped the quality. Our fix: quantize scales per-channel (outer vector scaling) + rescale by 1/8 to avoid FP8 clipping. Result: >99.5% of W4A16 accuracy recovered on Command A & Cohere MoE. Paired with a CUTLASS …
Excited to share our work on production-ready W4A8 inference, now integrated in vLLM! By combining 4-bit weights (low memory) with 8-bit activations (high compute), we hit the sweet spot for both decoding and prefill — up to 58% faster TTFT and 45% faster TPOT vs W4A16 on Hopper.…
The quadratic computational complexity of the standard attention mechanism constitutes a fundamental bottleneck for large language models in long-context inference. While existing KV cache compression methods alleviate memory pressure, they often sacrifice generation quality and …
📰 5 KV Cache Compression Techniques in 2026 That Slash LLM Memory Overhead by Up to 7.7x Top KV cache compression techniques are transforming LLM inference by reducing memory overhead through entropy coding, quantization, and rematerialization. These methods enable faster, cheape…
📰 KV Cache Sıkıştırma: 2026'da LLM Bellek Aşırısını Azaltan 10 Kanıtlanmış Yöntem LLM'lerin en büyük zorluğu olan KV cache bellek aşırısını çözen 10 yenilikçi yöntem, entropy coding, low-rank decompositions ve rematerialization ile birleşiyor. Bu teknikler, model boyutunu yarıya …