Researchers unveil new methods to boost LLM inference speed and efficiency

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 26 sources

Google Research has introduced "speculative cascades," a novel method to enhance Large Language Model (LLM) efficiency by merging speculative decoding with standard cascades. This hybrid approach aims to reduce computational costs and inference latency without compromising output quality. By strategically using smaller models to predict tokens and then verifying them with larger models, speculative cascades offer improved cost-quality trade-offs compared to either technique used in isolation, as demonstrated with Gemma and T5 models. AI

Summary written by gemini-2.5-flash-lite from 26 sources. How we write summaries →

IMPACT New inference techniques like speculative cascades and KV cache compression could significantly reduce operational costs for LLM deployments.

RANK_REASON The cluster contains research papers detailing new methods for improving LLM inference efficiency.

Read on Hugging Face Blog →

paper
infra

COVERAGE [26]

Google AI / Research TIER_1 · 2025-09-11 22:01

Speculative cascades — A hybrid approach for smarter, faster LLM inference

Generative AI
Hugging Face Blog TIER_1 · 2025-03-28 00:00

🚀 Accelerating LLM Inference with TGI on Intel Gaudi
Hugging Face Blog TIER_1 · 2023-12-05 00:00

Optimum-NVIDIA Unlocking blazingly fast LLM inference in just 1 line of code
arXiv cs.AI TIER_1 · Sanjeev Rao Ganjihal · 2026-05-01 04:00

Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference

arXiv:2604.26968v1 Announce Type: cross Abstract: Key-value (KV) cache memory management is the primary bottleneck limiting throughput and cost-efficiency in large-scale GPU inference serving. Current systems suffer from three compounding inefficiencies: (1) the absence of unifie…
arXiv cs.LG TIER_1 · Aditya Ukarande, Deep Shekhar, Marc Blackstein, Ram Rangan · 2026-04-30 04:00

Efficient, VRAM-Constrained xLM Inference on Clients

arXiv:2604.26334v1 Announce Type: cross Abstract: To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on c…
arXiv cs.AI TIER_1 · Bodon Jeong, Hongsu Byun, Youngjae Kim, Weikuan Yu, Kyungkeun Lee, Jihoon Yang, Sungyong Park · 2026-04-30 04:00

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

arXiv:2604.26557v1 Announce Type: cross Abstract: The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key challenge arises from Key-Value (KV) caches, which often exceed available device me…
arXiv cs.AI TIER_1 · Sungyong Park · 2026-04-29 11:44

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key challenge arises from Key-Value (KV) caches, which often exceed available device memory. Although NVMe-based offloading offers scalab…
arXiv cs.LG TIER_1 · Ram Rangan · 2026-04-29 06:35

Efficient, VRAM-Constrained xLM Inference on Clients

To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on client systems. To address this, we present pipelin…
arXiv cs.LG TIER_1 · Nada Zine, Cl\'ement Quinton, Romain Rouvoy · 2026-04-29 04:00

Pimp My LLM: Leveraging Variability Modeling to Tune Inference Hyperparameters

arXiv:2602.17697v2 Announce Type: replace Abstract: Large Language Models (LLMs) are being increasingly used across a wide range of tasks. However, their substantial computational demands raise concerns about the energy efficiency and sustainability of both training and inference…
arXiv cs.CL TIER_1 · Ishan Patel, Ishan Joshi · 2026-04-29 04:00

PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference

arXiv:2604.24971v1 Announce Type: cross Abstract: We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent -- the standard paradigm -- PolyKV writes a co…
arXiv cs.CL TIER_1 · Zahra Dehghanighobadi, Asja Fischer · 2026-04-28 04:00

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference

arXiv:2604.24647v1 Announce Type: new Abstract: Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on th…
arXiv cs.CL TIER_1 · Marta Adamska, Daria Smirnova, Hamid Nasiri, Zhengxin Yu, Peter Garraghan · 2026-04-28 04:00

Green Prompting: Characterizing Prompt-driven Energy Costs of LLM Inference

arXiv:2503.10666v4 Announce Type: replace Abstract: Large Language Models (LLMs) have become widely used across various domains spanning search engines, code generation, and text creation. However, a major concern associated with their adoption is the high cost of inference, impa…
arXiv cs.CL TIER_1 · Yi Su, Zhenxu Tian, Dan Qiao, Yuechi Zhou, Juntao Li, Min Zhang · 2026-04-28 04:00

LongFlow: Efficient KV Cache Compression for Reasoning Models

arXiv:2603.11504v2 Announce Type: replace-cross Abstract: Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer …
arXiv cs.LG TIER_1 · Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jing Liu, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Bailu Ding, Xiao Yan, Jiawei Jiang, Chen Chen, Mingxing Zhang, Cheng Li, Yuqing Yang, Fan Yang, Mao Yang · 2026-04-28 04:00

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

arXiv:2505.02922v3 Announce Type: replace Abstract: Recent large language models (LLMs) are rapidly extending their context windows, yet inference throughput lags due to increasing GPU memory and bandwidth demands. This is because the key-value (KV) cache, an intermediate structu…
arXiv cs.CL TIER_1 · Ishan Joshi · 2026-04-27 20:10

PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference

We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent -- the standard paradigm -- PolyKV writes a compressed cache once and injects it into N independ…
arXiv cs.CL TIER_1 · Asja Fischer · 2026-04-27 16:15

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference

Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the key-value (KV) cache, whose memory footprint g…
Hugging Face Daily Papers TIER_1 · 2026-04-27 16:15

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference

Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the key-value (KV) cache, whose memory footprint g…
arXiv cs.CL TIER_1 · Dinghong Song, Jierui Xu, Weichu Yang, Pengfei Su, Dong Li · 2026-04-27 04:00

NeuronMLP: Efficient LLM Inference via Singular Value Decomposition Compression and Tiling on AWS Trainium

arXiv:2510.25977v4 Announce Type: replace Abstract: Emerging AI accelerators have started to gain attention and offer new opportunities for efficient inference of large language models (LLMs). Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides a…
arXiv cs.LG TIER_1 · Anurita Das · 2026-04-27 04:00

MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

arXiv:2604.21026v2 Announce Type: replace Abstract: Deploying large language models to heterogeneous hardware is often constrained by memory, not compute. We introduce MCAP (Monte Carlo Activation Profiling), a load-time per-layer importance estimator that enables dynamic precisi…
X — Cohere TIER_1 · cohere · 2026-04-22 20:38

Enjoyed the read? If you have deep experience in ML frameworks (training or inference) and love working on problems like these, our team is hiring!

Enjoyed the read? If you have deep experience in ML frameworks (training or inference) and love working on problems like these, our team is hiring! ML Systems Engineer, Frameworks & Tooling: https://t.co/IyMnsfplXv Audio Inference Engineer, Model Efficiency:
X — Cohere TIER_1 · cohere · 2026-04-22 20:38

For real agentic workloads (North), short-context calibration wasn't enough. We calibrated AWQ on long internal agentic traces (up to 64k tokens) and added toke

For real agentic workloads (North), short-context calibration wasn't enough. We calibrated AWQ on long internal agentic traces (up to 64k tokens) and added token masking in llm-compressor to exclude repetitive chat templates/tool descriptions from calibration stats. Plus QAD http…
X — Cohere TIER_1 · cohere · 2026-04-22 20:38

🔧 The tricky part: naïvely casting BF16 group scales to FP8 dropped the quality. Our fix: quantize scales per-channel (outer vector scaling) + rescale by 1/8 to

🔧 The tricky part: naïvely casting BF16 group scales to FP8 dropped the quality. Our fix: quantize scales per-channel (outer vector scaling) + rescale by 1/8 to avoid FP8 clipping. Result: >99.5% of W4A16 accuracy recovered on Command A & Cohere MoE. Paired with a CUTLASS …
X — Cohere TIER_1 · cohere · 2026-04-22 20:38

Excited to share our work on production-ready W4A8 inference, now integrated in vLLM! By combining 4-bit weights (low memory) with 8-bit activations (high compu

Excited to share our work on production-ready W4A8 inference, now integrated in vLLM! By combining 4-bit weights (low memory) with 8-bit activations (high compute), we hit the sweet spot for both decoding and prefill — up to 58% faster TTFT and 45% faster TPOT vs W4A16 on Hopper.…
Hugging Face Daily Papers TIER_1 · 2026-04-21 11:33

DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

The quadratic computational complexity of the standard attention mechanism constitutes a fundamental bottleneck for large language models in long-context inference. While existing KV cache compression methods alleviate memory pressure, they often sacrifice generation quality and …
Mastodon — mastodon.social TIER_1 · aihaberleri · 2026-04-29 19:44

📰 5 KV Cache Compression Techniques in 2026 That Slash LLM Memory Overhead by Up to 7.7x Top KV cache compression techniques are transforming LLM inference by r

📰 5 KV Cache Compression Techniques in 2026 That Slash LLM Memory Overhead by Up to 7.7x Top KV cache compression techniques are transforming LLM inference by reducing memory overhead through entropy coding, quantization, and rematerialization. These methods enable faster, cheape…
Mastodon — mastodon.social TIER_1 Türkçe(TR) · aihaberleri · 2026-04-29 19:44

📰 KV Cache Compression: 10 Proven Methods to Reduce LLM Memory Overflow in 2026 10 innovative methods to solve the KV cache memory overflow, the biggest challenge for LLMs

📰 KV Cache Sıkıştırma: 2026'da LLM Bellek Aşırısını Azaltan 10 Kanıtlanmış Yöntem LLM'lerin en büyük zorluğu olan KV cache bellek aşırısını çözen 10 yenilikçi yöntem, entropy coding, low-rank decompositions ve rematerialization ile birleşiyor. Bu teknikler, model boyutunu yarıya …

COVERAGE [26]

RELATED ENTITIES

RELATED TOPICS