New research explores efficient LLM inference through sparse caching, batching, and secure computation.
ByPulseAugur Editorial·
Summary by gemini-2.5-flash-lite
from 29 sources
Multiple research papers are exploring novel techniques to enhance the efficiency and performance of Large Language Model (LLM) inference and training. These advancements include queueing-theoretic frameworks for stability analysis, capacity-aware data mixture laws for optimization, and overhead-aware KV cache loading for on-device deployment. Other research focuses on secure inference over encrypted data, accelerating long-context inference with asymmetric hashing, and optimizing distributed training with dynamic sparse attention. Additionally, systems are being developed for multi-SLO serving and fast scaling, alongside hardware accelerators integrating NPUs and PIM for edge LLM inference.
AI
IMPACT
These research efforts aim to significantly reduce the computational and memory costs associated with LLMs, potentially enabling wider deployment and more efficient use of resources.
RANK_REASON
This cluster consists of multiple arXiv preprints detailing research into LLM inference and training optimization techniques.
The recent growth of on-device Large Language Model (LLM) inference has driven significant interest in device-edge collaborative LLM inference. As a promising architecture, Speculative Decoding (SD) is increasingly adopted where a lightweight draft model rapidly generates candida…
Selecting the optimal LLM inference configuration requires evaluation across hardware, serving engines, attention backends, and model architectures, since no single choice performs best across all workloads. Profile-based simulators are the standard tool, yet they hardcode their …
arXiv:2605.05873v1 Announce Type: cross Abstract: Large language models often improve reasoning by sampling multiple outputs and aggregating their final answers, but precise and efficient control of error levels remains a challenging task. In particular, deciding when to stop sam…
arXiv:2605.06046v1 Announce Type: new Abstract: Auto-regressive token generation in large language models is memory-bound because it requires "attending to" key and value tensors (KV cache) of all previous tokens. Prior work aims to improve the efficiency of this decode process b…
arXiv:2605.05219v1 Announce Type: new Abstract: Prefix caching is a key latency optimization for autoregressive LLM serving, yet existing systems assume dense per-token key/value reuse. State-space models change the structure of the problem: a recurrent layer can resume from a si…
arXiv:2604.21231v2 Announce Type: replace-cross Abstract: Efficient inference for on-device Large Language Models (LLMs) remains challenging due to limited hardware resources and the high cost of the prefill stage, which processes the full input context to construct Key-Value (KV…
arXiv cs.LG
TIER_1·Chengyi Nie, Nian Si, Zijie Zhou·
arXiv:2605.04595v1 Announce Type: new Abstract: The rapid adoption of large language models (LLMs) has created significant challenges for efficient inference at scale. Unlike traditional workloads, LLM inference is constrained by both computation and the memory overhead of key-va…
arXiv:2603.08022v2 Announce Type: replace Abstract: A data mixture refers to how different data sources are combined to train large language models, and selecting an effective mixture is crucial for optimal downstream performance. Existing methods either conduct costly searches d…
arXiv:2605.00831v1 Announce Type: cross Abstract: The rise of million-token, agent-based applications has placed unprecedented demands on large language model (LLM) inference services. The long-running nature of these tasks increases their susceptibility to hardware and software …
arXiv:2410.09457v2 Announce Type: replace Abstract: Modern cryptographic methods for implementing privacy-preserving LLMs such as \gls{HE} require the LLMs to have a polynomial form. Forming such a representation is challenging because transformers include non-polynomial componen…
arXiv:2510.13668v2 Announce Type: replace-cross Abstract: Large Language Model (LLM) inference has emerged as a fundamental paradigm, however, variations in output length cause severe workload imbalance in the decode phase, particularly for long-output reasoning tasks. Existing s…
arXiv cs.LG
TIER_1·Yuzong Chen, Chao Fang, Xilai Dai, Yuheng Wu, Thierry Tambe, Marian Verhelst, Mohamed S. Abdelfattah·
arXiv:2511.06838v4 Announce Type: replace-cross Abstract: The substantial memory bandwidth and computational demands of large language models (LLMs) present critical challenges for efficient inference. To tackle this, the literature has explored heterogeneous systems that combine…
arXiv:2604.19351v3 Announce Type: replace Abstract: The quadratic computational complexity of the standard attention mechanism constitutes a fundamental bottleneck for large language models in long-context inference. While existing KV cache compression methods alleviate memory pr…
arXiv:2505.11329v5 Announce Type: replace-cross Abstract: Distributed inference of large language models (LLMs) using tensor parallelism can introduce communication overheads of $20$% even over GPUs connected via NVLink, a high-speed GPU interconnect. Several techniques have been…
arXiv:2604.13847v2 Announce Type: replace Abstract: While sparse attention mitigates the computational bottleneck of long-context LLM training, its distributed training process exhibits extreme heterogeneity in both \textit{1)} sequence length and \textit{2)} sparsity sensitivity…
arXiv cs.AI
TIER_1·Zahra Yousefijamarani, Xinglu Wang, Qian Wang, Morgan Lindsay Heisler, Taha Shabani, Niloofar Gholipour, Parham Yassini, Hong Chang, Kan Chen, Qiantao Zhang, Xiaolong Bai, Jiannan Wang, Ying Xiong, Yong Zhang, Zhenan Fan·
arXiv:2508.15919v3 Announce Type: replace-cross Abstract: Large language model (LLM) serving faces the dual challenge of meeting strict user-specific service-level objectives (SLOs) while minimizing computational cost under dynamic, multi-task workloads. Existing approaches eithe…
As large language models span dense, mixture-of-experts, and state-space architectures and are deployed on heterogeneous accelerators under increasingly diverse multimodal workloads, optimising inference energy has become as critical as optimizing latency and throughput. Existing…
Large language models often improve reasoning by sampling multiple outputs and aggregating their final answers, but precise and efficient control of error levels remains a challenging task. In particular, deciding when to stop sampling remains difficult when the stopping rule is …
Hacker News — AI stories ≥50 points
TIER_1·mitchwainer·
<p>Inference efficiency has quietly become one of the most consequential bottlenecks in AI deployment. As agentic coding systems such as Claude Code, Codex, and Cursor scale from developer tools to infrastructure powering software development at large, the underlying inference en…
<blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/model-quantization.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</em><…
<h2> Another inference engine? </h2> <p>So TokenSpeed is trending on GitHub this week, billing itself as a "speed-of-light LLM inference engine." I clicked through expecting either a vLLM clone or another Rust rewrite of llama.cpp. I haven't run it in production yet — the repo is…
<p>The number of LLM providers keeps growing and so does the confusion around pricing, availability and compatibility. OpenModels is an open-source project that brings structure to this landscape: a single registry where models, providers, and their relationships are documented, …
LightSeek Foundation has released TokenSpeed, an open-source LLM inference engine designed specifically for agentic AI workloads. The engine uses a C++ finite-state machine to enforce KV cache safety at compile time and outperformed TensorRT-LLM by around 9-11% on NVIDIA Blackwel…
📰 TokenSpeed 2026: Open-Source LLM Inference Engine Beats TensorRT-LLM in Agentic Workloads TokenSpeed, a new open-source LLM inference engine from the LightSeek Foundation, targets TensorRT-LLM-level performance for agentic coding systems. Designed to reduce latency and power co…
📰 TokenSpeed 2026: LightSeek Foundation, Agentic İş Yükleri İçin LLM Çıktı Hızını %60 Daha Verimli ... LightSeek Foundation, agentic sistemlerin talebini karşılamak için TokenSpeed adlı açık kaynaklı bir LLM çıkarım motorunu serbest bıraktı. Bu teknoloji, TensorRT-LLM seviyesinde…