English(EN)🚀 Accelerating LLM Inference with TGI on Intel Gaudi
研究人员揭示提高 LLM 推理速度和效率的新方法
作者PulseAugur 编辑部·[26 个来源]·
Google Research 推出了“投机级联”(speculative cascades),一种通过将投机解码与标准级联相结合来提高大型语言模型(LLM)效率的新颖方法。这种混合方法旨在降低计算成本和推理延迟,同时不损害输出质量。通过策略性地使用较小的模型来预测 token,然后用较大的模型进行验证,投机级联与单独使用任一技术相比,提供了更好的成本-质量权衡,Gemma 和 T5 模型已证明了这一点。
AI
arXiv:2604.26968v1 Announce Type: cross Abstract: Key-value (KV) cache memory management is the primary bottleneck limiting throughput and cost-efficiency in large-scale GPU inference serving. Current systems suffer from three compounding inefficiencies: (1) the absence of unifie…
arXiv cs.LG
TIER_1English(EN)·Aditya Ukarande, Deep Shekhar, Marc Blackstein, Ram Rangan·
arXiv:2604.26334v1 Announce Type: cross Abstract: To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on c…
arXiv:2604.26557v1 Announce Type: cross Abstract: The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key challenge arises from Key-Value (KV) caches, which often exceed available device me…
The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key challenge arises from Key-Value (KV) caches, which often exceed available device memory. Although NVMe-based offloading offers scalab…
To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on client systems. To address this, we present pipelin…
arXiv:2602.17697v2 Announce Type: replace Abstract: Large Language Models (LLMs) are being increasingly used across a wide range of tasks. However, their substantial computational demands raise concerns about the energy efficiency and sustainability of both training and inference…
arXiv:2604.24971v1 Announce Type: cross Abstract: We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent -- the standard paradigm -- PolyKV writes a co…
arXiv cs.CL
TIER_1English(EN)·Zahra Dehghanighobadi, Asja Fischer·
arXiv:2604.24647v1 Announce Type: new Abstract: Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on th…
arXiv:2503.10666v4 Announce Type: replace Abstract: Large Language Models (LLMs) have become widely used across various domains spanning search engines, code generation, and text creation. However, a major concern associated with their adoption is the high cost of inference, impa…
arXiv cs.CL
TIER_1English(EN)·Yi Su, Zhenxu Tian, Dan Qiao, Yuechi Zhou, Juntao Li, Min Zhang·
arXiv:2603.11504v2 Announce Type: replace-cross Abstract: Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer …
arXiv:2505.02922v3 Announce Type: replace Abstract: Recent large language models (LLMs) are rapidly extending their context windows, yet inference throughput lags due to increasing GPU memory and bandwidth demands. This is because the key-value (KV) cache, an intermediate structu…
We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent -- the standard paradigm -- PolyKV writes a compressed cache once and injects it into N independ…
Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the key-value (KV) cache, whose memory footprint g…
Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the key-value (KV) cache, whose memory footprint g…
arXiv:2510.25977v4 Announce Type: replace Abstract: Emerging AI accelerators have started to gain attention and offer new opportunities for efficient inference of large language models (LLMs). Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides a…
arXiv:2604.21026v2 Announce Type: replace Abstract: Deploying large language models to heterogeneous hardware is often constrained by memory, not compute. We introduce MCAP (Monte Carlo Activation Profiling), a load-time per-layer importance estimator that enables dynamic precisi…
Enjoyed the read? If you have deep experience in ML frameworks (training or inference) and love working on problems like these, our team is hiring!
ML Systems Engineer, Frameworks & Tooling: https://t.co/IyMnsfplXv
Audio Inference Engineer, Model Efficiency:
For real agentic workloads (North), short-context calibration wasn't enough. We calibrated AWQ on long internal agentic traces (up to 64k tokens) and added token masking in llm-compressor to exclude repetitive chat templates/tool descriptions from calibration stats. Plus QAD http…
🔧 The tricky part: naïvely casting BF16 group scales to FP8 dropped the quality. Our fix: quantize scales per-channel (outer vector scaling) + rescale by 1/8 to avoid FP8 clipping. Result: >99.5% of W4A16 accuracy recovered on Command A & Cohere MoE. Paired with a CUTLASS …
Excited to share our work on production-ready W4A8 inference, now integrated in vLLM! By combining 4-bit weights (low memory) with 8-bit activations (high compute), we hit the sweet spot for both decoding and prefill — up to 58% faster TTFT and 45% faster TPOT vs W4A16 on Hopper.…
The quadratic computational complexity of the standard attention mechanism constitutes a fundamental bottleneck for large language models in long-context inference. While existing KV cache compression methods alleviate memory pressure, they often sacrifice generation quality and …
📰 5 KV Cache Compression Techniques in 2026 That Slash LLM Memory Overhead by Up to 7.7x Top KV cache compression techniques are transforming LLM inference by reducing memory overhead through entropy coding, quantization, and rematerialization. These methods enable faster, cheape…
📰 KV Cache Sıkıştırma: 2026'da LLM Bellek Aşırısını Azaltan 10 Kanıtlanmış Yöntem LLM'lerin en büyük zorluğu olan KV cache bellek aşırısını çözen 10 yenilikçi yöntem, entropy coding, low-rank decompositions ve rematerialization ile birleşiyor. Bu teknikler, model boyutunu yarıya …