English(EN) 🚀 Accelerating LLM Inference with TGI on Intel Gaudi

研究人员揭示提高 LLM 推理速度和效率的新方法

作者 PulseAugur 编辑部 · [26 个来源] · 2023-12-05 00:00

Google Research 推出了“投机级联”（speculative cascades），一种通过将投机解码与标准级联相结合来提高大型语言模型（LLM）效率的新颖方法。这种混合方法旨在降低计算成本和推理延迟，同时不损害输出质量。通过策略性地使用较小的模型来预测 token，然后用较大的模型进行验证，投机级联与单独使用任一技术相比，提供了更好的成本-质量权衡，Gemma 和 T5 模型已证明了这一点。 AI

影响像投机级联和 KV 缓存压缩这样的新推理技术可以显著降低 LLM 部署的运营成本。

排序理由该集群包含详细介绍改进 LLM 推理效率新方法的学术论文。

在 Hugging Face Blog 阅读 →

AI 生成摘要 · Google Gemini · 来自 26 个来源。我们如何撰写摘要 →

报道来源 [26]

Google AI / Research TIER_1 English(EN) · 2025-09-11 22:01

Speculative cascades — A hybrid approach for smarter, faster LLM inference

Generative AI
Hugging Face Blog TIER_1 English(EN) · 2025-03-28 00:00

🚀 Accelerating LLM Inference with TGI on Intel Gaudi
Hugging Face Blog TIER_1 English(EN) · 2023-12-05 00:00

Optimum-NVIDIA Unlocking blazingly fast LLM inference in just 1 line of code
arXiv cs.AI TIER_1 English(EN) · Sanjeev Rao Ganjihal · 2026-05-01 04:00

Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference

arXiv:2604.26968v1 Announce Type: cross Abstract: Key-value (KV) cache memory management is the primary bottleneck limiting throughput and cost-efficiency in large-scale GPU inference serving. Current systems suffer from three compounding inefficiencies: (1) the absence of unifie…
arXiv cs.LG TIER_1 English(EN) · Aditya Ukarande, Deep Shekhar, Marc Blackstein, Ram Rangan · 2026-04-30 04:00

Efficient, VRAM-Constrained xLM Inference on Clients

arXiv:2604.26334v1 Announce Type: cross Abstract: To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on c…
arXiv cs.AI TIER_1 English(EN) · Bodon Jeong, Hongsu Byun, Youngjae Kim, Weikuan Yu, Kyungkeun Lee, Jihoon Yang, Sungyong Park · 2026-04-30 04:00

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

arXiv:2604.26557v1 Announce Type: cross Abstract: The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key challenge arises from Key-Value (KV) caches, which often exceed available device me…
arXiv cs.AI TIER_1 English(EN) · Sungyong Park · 2026-04-29 11:44

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key challenge arises from Key-Value (KV) caches, which often exceed available device memory. Although NVMe-based offloading offers scalab…
arXiv cs.LG TIER_1 English(EN) · Ram Rangan · 2026-04-29 06:35

Efficient, VRAM-Constrained xLM Inference on Clients

To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on client systems. To address this, we present pipelin…
arXiv cs.LG TIER_1 English(EN) · Nada Zine, Cl\'ement Quinton, Romain Rouvoy · 2026-04-29 04:00

Pimp My LLM: Leveraging Variability Modeling to Tune Inference Hyperparameters

arXiv:2602.17697v2 Announce Type: replace Abstract: Large Language Models (LLMs) are being increasingly used across a wide range of tasks. However, their substantial computational demands raise concerns about the energy efficiency and sustainability of both training and inference…
arXiv cs.CL TIER_1 English(EN) · Ishan Patel, Ishan Joshi · 2026-04-29 04:00

PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference

arXiv:2604.24971v1 Announce Type: cross Abstract: We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent -- the standard paradigm -- PolyKV writes a co…
arXiv cs.CL TIER_1 English(EN) · Zahra Dehghanighobadi, Asja Fischer · 2026-04-28 04:00

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference

arXiv:2604.24647v1 Announce Type: new Abstract: Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on th…
arXiv cs.CL TIER_1 English(EN) · Marta Adamska, Daria Smirnova, Hamid Nasiri, Zhengxin Yu, Peter Garraghan · 2026-04-28 04:00

Green Prompting: Characterizing Prompt-driven Energy Costs of LLM Inference

arXiv:2503.10666v4 Announce Type: replace Abstract: Large Language Models (LLMs) have become widely used across various domains spanning search engines, code generation, and text creation. However, a major concern associated with their adoption is the high cost of inference, impa…
arXiv cs.CL TIER_1 English(EN) · Yi Su, Zhenxu Tian, Dan Qiao, Yuechi Zhou, Juntao Li, Min Zhang · 2026-04-28 04:00

LongFlow: Efficient KV Cache Compression for Reasoning Models

arXiv:2603.11504v2 Announce Type: replace-cross Abstract: Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer …
arXiv cs.LG TIER_1 English(EN) · Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jing Liu, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Bailu Ding, Xiao Yan, Jiawei Jiang, Chen Chen, Mingxing Zhang, Cheng Li, Yuqing Yang, Fan Yang, Mao Yang · 2026-04-28 04:00

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

arXiv:2505.02922v3 Announce Type: replace Abstract: Recent large language models (LLMs) are rapidly extending their context windows, yet inference throughput lags due to increasing GPU memory and bandwidth demands. This is because the key-value (KV) cache, an intermediate structu…
arXiv cs.CL TIER_1 English(EN) · Ishan Joshi · 2026-04-27 20:10

PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference

We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent -- the standard paradigm -- PolyKV writes a compressed cache once and injects it into N independ…
arXiv cs.CL TIER_1 English(EN) · Asja Fischer · 2026-04-27 16:15

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference

Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the key-value (KV) cache, whose memory footprint g…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-04-27 16:15

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference

Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the key-value (KV) cache, whose memory footprint g…
arXiv cs.CL TIER_1 English(EN) · Dinghong Song, Jierui Xu, Weichu Yang, Pengfei Su, Dong Li · 2026-04-27 04:00

NeuronMLP: Efficient LLM Inference via Singular Value Decomposition Compression and Tiling on AWS Trainium

arXiv:2510.25977v4 Announce Type: replace Abstract: Emerging AI accelerators have started to gain attention and offer new opportunities for efficient inference of large language models (LLMs). Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides a…
arXiv cs.LG TIER_1 English(EN) · Anurita Das · 2026-04-27 04:00

MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

arXiv:2604.21026v2 Announce Type: replace Abstract: Deploying large language models to heterogeneous hardware is often constrained by memory, not compute. We introduce MCAP (Monte Carlo Activation Profiling), a load-time per-layer importance estimator that enables dynamic precisi…
X — Cohere TIER_1 English(EN) · cohere · 2026-04-22 20:38

Enjoyed the read? If you have deep experience in ML frameworks (training or inference) and love working on problems like these, our team is hiring!

Enjoyed the read? If you have deep experience in ML frameworks (training or inference) and love working on problems like these, our team is hiring! ML Systems Engineer, Frameworks & Tooling: https://t.co/IyMnsfplXv Audio Inference Engineer, Model Efficiency:
X — Cohere TIER_1 English(EN) · cohere · 2026-04-22 20:38

For real agentic workloads (North), short-context calibration wasn't enough. We calibrated AWQ on long internal agentic traces (up to 64k tokens) and added toke

For real agentic workloads (North), short-context calibration wasn't enough. We calibrated AWQ on long internal agentic traces (up to 64k tokens) and added token masking in llm-compressor to exclude repetitive chat templates/tool descriptions from calibration stats. Plus QAD http…
X — Cohere TIER_1 English(EN) · cohere · 2026-04-22 20:38

🔧 The tricky part: naïvely casting BF16 group scales to FP8 dropped the quality. Our fix: quantize scales per-channel (outer vector scaling) + rescale by 1/8 to

🔧 The tricky part: naïvely casting BF16 group scales to FP8 dropped the quality. Our fix: quantize scales per-channel (outer vector scaling) + rescale by 1/8 to avoid FP8 clipping. Result: >99.5% of W4A16 accuracy recovered on Command A & Cohere MoE. Paired with a CUTLASS …
X — Cohere TIER_1 English(EN) · cohere · 2026-04-22 20:38

Excited to share our work on production-ready W4A8 inference, now integrated in vLLM! By combining 4-bit weights (low memory) with 8-bit activations (high compu

Excited to share our work on production-ready W4A8 inference, now integrated in vLLM! By combining 4-bit weights (low memory) with 8-bit activations (high compute), we hit the sweet spot for both decoding and prefill — up to 58% faster TTFT and 45% faster TPOT vs W4A16 on Hopper.…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-04-21 11:33

DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

The quadratic computational complexity of the standard attention mechanism constitutes a fundamental bottleneck for large language models in long-context inference. While existing KV cache compression methods alleviate memory pressure, they often sacrifice generation quality and …
Mastodon — mastodon.social TIER_1 English(EN) · aihaberleri · 2026-04-29 19:44

📰 5 KV Cache Compression Techniques in 2026 That Slash LLM Memory Overhead by Up to 7.7x Top KV cache compression techniques are transforming LLM inference by r

📰 5 KV Cache Compression Techniques in 2026 That Slash LLM Memory Overhead by Up to 7.7x Top KV cache compression techniques are transforming LLM inference by reducing memory overhead through entropy coding, quantization, and rematerialization. These methods enable faster, cheape…
Mastodon — mastodon.social TIER_1 Türkçe(TR) · aihaberleri · 2026-04-29 19:44

📰 KV Cache Compression: 10 Proven Methods to Reduce LLM Memory Overflow in 2026 10 innovative methods to solve the KV cache memory overflow, the biggest challenge for LLMs

📰 KV Cache Sıkıştırma: 2026'da LLM Bellek Aşırısını Azaltan 10 Kanıtlanmış Yöntem LLM'lerin en büyük zorluğu olan KV cache bellek aşırısını çözen 10 yenilikçi yöntem, entropy coding, low-rank decompositions ve rematerialization ile birleşiyor. Bu teknikler, model boyutunu yarıya …

报道来源 [26]

相关实体

相关话题