New research explores efficient LLM inference through sparse caching, batching, and secure computation.

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 29 sources

Multiple research papers are exploring novel techniques to enhance the efficiency and performance of Large Language Model (LLM) inference and training. These advancements include queueing-theoretic frameworks for stability analysis, capacity-aware data mixture laws for optimization, and overhead-aware KV cache loading for on-device deployment. Other research focuses on secure inference over encrypted data, accelerating long-context inference with asymmetric hashing, and optimizing distributed training with dynamic sparse attention. Additionally, systems are being developed for multi-SLO serving and fast scaling, alongside hardware accelerators integrating NPUs and PIM for edge LLM inference. AI

Summary written by gemini-2.5-flash-lite from 29 sources. How we write summaries →

IMPACT These research efforts aim to significantly reduce the computational and memory costs associated with LLMs, potentially enabling wider deployment and more efficient use of resources.

RANK_REASON This cluster consists of multiple arXiv preprints detailing research into LLM inference and training optimization techniques.

Read on arXiv cs.LG →

paper
infra

COVERAGE [29]

arXiv cs.LG TIER_1 · Bo Ai · 2026-05-11 07:38

GELATO: Generative Entropy- and Lyapunov-based Adaptive Token Offloading for Device-Edge Speculative LLM Inference

The recent growth of on-device Large Language Model (LLM) inference has driven significant interest in device-edge collaborative LLM inference. As a promising architecture, Speculative Decoding (SD) is increasingly adopted where a lightweight draft model rapidly generates candida…
arXiv cs.AI TIER_1 · Daehyeok Kim · 2026-05-08 16:44

Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation

Selecting the optimal LLM inference configuration requires evaluation across hardware, serving engines, attention backends, and model architectures, since no single choice performs best across all workloads. Profile-based simulators are the standard tool, yet they hardcode their …
arXiv cs.LG TIER_1 · Hirofumi Ota, Naoto Iwase, Yuki Ichihara, Junpei Komiyama, Masaaki Imaizumi · 2026-05-08 04:00

CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency

arXiv:2605.05873v1 Announce Type: cross Abstract: Large language models often improve reasoning by sampling multiple outputs and aggregating their final answers, but precise and efficient control of error levels remains a challenging task. In particular, deciding when to stop sam…
arXiv cs.LG TIER_1 · Saksham Rathi, Preeti, Mythili Vutukuru · 2026-05-08 04:00

Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference

arXiv:2605.06046v1 Announce Type: new Abstract: Auto-regressive token generation in large language models is memory-bound because it requires "attending to" key and value tensors (KV cache) of all previous tokens. Prior work aims to improve the efficiency of this decode process b…
arXiv cs.LG TIER_1 · Mikhail Shirokikh, Sergey Nikolenko · 2026-05-08 04:00

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

arXiv:2605.05219v1 Announce Type: new Abstract: Prefix caching is a key latency optimization for autoregressive LLM serving, yet existing systems assume dense per-token key/value reuse. State-space models change the structure of the problem: a recurrent layer can resume from a si…
arXiv cs.AI TIER_1 · Hongyao Liu, Liuqun Zhai, Junyi Wang, Zhengru Fang · 2026-05-07 04:00

SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference

arXiv:2604.21231v2 Announce Type: replace-cross Abstract: Efficient inference for on-device Large Language Models (LLMs) remains challenging due to limited hardware resources and the high cost of the prefill stage, which processes the full input context to construct Key-Value (KV…
arXiv cs.LG TIER_1 · Chengyi Nie, Nian Si, Zijie Zhou · 2026-05-07 04:00

A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints

arXiv:2605.04595v1 Announce Type: new Abstract: The rapid adoption of large language models (LLMs) has created significant challenges for efficient inference at scale. Unlike traditional workloads, LLM inference is constrained by both computation and the memory overhead of key-va…
arXiv cs.LG TIER_1 · Jingwei Li, Xinran Gu, Jingzhao Zhang · 2026-05-07 04:00

Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization

arXiv:2603.08022v2 Announce Type: replace Abstract: A data mixture refers to how different data sources are combined to train large language models, and selecting an effective mixture is crucial for optimal downstream performance. Existing methods either conduct costly searches d…
arXiv cs.AI TIER_1 · Shakya Jayakody, Youpeng Zhao, Chinmay Dhanraj Nehate, Jun Wang · 2026-05-06 04:00

GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving

arXiv:2605.00831v1 Announce Type: cross Abstract: The rise of million-token, agent-based applications has placed unprecedented demands on large language model (LLM) inference services. The long-running nature of these tasks increases their susceptibility to hardware and software …
arXiv cs.LG TIER_1 · Itamar Zimerman, Allon Adir, Ehud Aharoni, Matan Avitan, Moran Baruch, Nir Drucker, Jenny Lerner, Ramy Masalha, Reut Meiri, Omri Soceanu · 2026-05-06 04:00

Power-Softmax: Towards Secure LLM Inference over Encrypted Data

arXiv:2410.09457v2 Announce Type: replace Abstract: Modern cryptographic methods for implementing privacy-preserving LLMs such as \gls{HE} require the LLMs to have a polynomial form. Forming such a representation is challenging because transformers include non-polynomial componen…
arXiv cs.LG TIER_1 · Zhibin Wang, Zetao Hong, Xue Li, Zibo Wang, Shipeng Li, Qingkai Meng, Qing Wang, Chengying Huan, Rong Gu, Sheng Zhong, Chen Tian · 2026-05-05 04:00

STAR: Decode-Phase Rescheduling for LLM Inference

arXiv:2510.13668v2 Announce Type: replace-cross Abstract: Large Language Model (LLM) inference has emerged as a fundamental paradigm, however, variations in output length cause severe workload imbalance in the decode phase, particularly for long-output reasoning tasks. Existing s…
arXiv cs.LG TIER_1 · Yuzong Chen, Chao Fang, Xilai Dai, Yuheng Wu, Thierry Tambe, Marian Verhelst, Mohamed S. Abdelfattah · 2026-05-05 04:00

P3-LLM: An Integrated NPU-PIM Accelerator for Edge LLM Inference Using Hybrid Numerical Formats

arXiv:2511.06838v4 Announce Type: replace-cross Abstract: The substantial memory bandwidth and computational demands of large language models (LLMs) present critical challenges for efficient inference. To tackle this, the literature has explored heterogeneous systems that combine…
arXiv cs.CL TIER_1 · Jinyu Guo, Zhihan Zhang, Jiehui Xie, Md. Tamim Iqbal, Dongshen Han, Lik-Hang Lee, Sung-Ho Bae, Jie Zou, Yang Yang, Chaoning Zhang · 2026-05-05 04:00

DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

arXiv:2604.19351v3 Announce Type: replace Abstract: The quadratic computational complexity of the standard attention mechanism constitutes a fundamental bottleneck for large language models in long-context inference. While existing KV cache compression methods alleviate memory pr…
arXiv cs.LG TIER_1 · Raja Gond, Nipun Kwatra, Ramachandran Ramjee · 2026-05-04 04:00

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

arXiv:2505.11329v5 Announce Type: replace-cross Abstract: Distributed inference of large language models (LLMs) using tensor parallelism can introduce communication overheads of $20$% even over GPUs connected via NVLink, a high-speed GPU interconnect. Several techniques have been…
arXiv cs.LG TIER_1 · Hongtao Xu, Jianchao Tan, Yuxuan Hu, Pengju Lu, Hongyu Wang, Pingwei Sun, Yerui Sun, Yuchen Xie, Xunliang Cai, Mingzhen Li, Weile Jia · 2026-04-27 04:00

SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention

arXiv:2604.13847v2 Announce Type: replace Abstract: While sparse attention mitigates the computational bottleneck of long-context LLM training, its distributed training process exhibits extreme heterogeneity in both \textit{1)} sequence length and \textit{2)} sparsity sensitivity…
arXiv cs.AI TIER_1 · Zahra Yousefijamarani, Xinglu Wang, Qian Wang, Morgan Lindsay Heisler, Taha Shabani, Niloofar Gholipour, Parham Yassini, Hong Chang, Kan Chen, Qiantao Zhang, Xiaolong Bai, Jiannan Wang, Ying Xiong, Yong Zhang, Zhenan Fan · 2026-04-27 04:00

HFX: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling

arXiv:2508.15919v3 Announce Type: replace-cross Abstract: Large language model (LLM) serving faces the dual challenge of meeting strict user-specific service-level objectives (SLOs) while minimizing computational cost under dynamic, multi-task workloads. Existing approaches eithe…
arXiv cs.CV TIER_1 · Zhiling Lan · 2026-05-11 13:31

EnergyLens: Interpretable Closed-Form Energy Models for Multimodal LLM Inference Serving

As large language models span dense, mixture-of-experts, and state-space architectures and are deployed on heterogeneous accelerators under increasingly diverse multimodal workloads, optimising inference energy has become as critical as optimizing latency and throughput. Existing…
arXiv stat.ML TIER_1 · Masaaki Imaizumi · 2026-05-07 08:41

CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency

Large language models often improve reasoning by sampling multiple outputs and aggregating their final answers, but precise and efficient control of error levels remains a challenging task. In particular, deciding when to stop sampling remains difficult when the stopping rule is …
Hacker News — AI stories ≥50 points TIER_1 · mitchwainer · 2026-05-05 14:37

SubQ: a sub-quadratic LLM with 12M-token context
MarkTechPost TIER_1 · Asif Razzaq · 2026-05-07 22:03

LightSeek Foundation Releases TokenSpeed, an Open-Source LLM Inference Engine Targeting TensorRT-LLM-Level Performance for Agentic Workloads

<p>Inference efficiency has quietly become one of the most consequential bottlenecks in AI deployment. As agentic coding systems such as Claude Code, Codex, and Cursor scale from developer tools to infrastructure powering software development at large, the underlying inference en…
Medium — MLOps tag TIER_1 · Tensormesh · 2026-05-08 20:53

Tensormesh Inference: Cheaper LLM Inference for AI Agents

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@tensormesh/tensormesh-inference-cheaper-llm-inference-for-ai-agents-a7fb7eba49a8?source=rss------mlops-5"><img src="https://cdn-images-1.medium.com/max/1216/0*H22cd4Xun81j4pX5.png" width="1216…
Towards AI TIER_1 · Kashif Mehmood · 2026-05-08 20:01

Understanding KV Cache in LLMs and How It Affects Inference

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/understanding-kv-cache-in-llms-and-how-it-affects-inference-a59c8860a57c?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/1536/1*pmsPEhyC3UIeRCFjdXSDzw.png" …
Medium — MLOps tag TIER_1 · Rajesh Balaji · 2026-05-08 06:06

Optimizing AI Performance: Modern Techniques for Efficient LLM Tuning and Inference

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@rajeshbalaji/optimizing-ai-performance-a-comprehensive-guide-to-modern-model-tuning-techniques-61de99b2286a?source=rss------mlops-5"><img src="https://cdn-images-1.medium.com/max/724/1*_tw32nE…
dev.to — LLM tag TIER_1 · 丁久 · 2026-05-12 11:29

Model Quantization: Making LLMs Smaller and Faster

<blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/model-quantization.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</em><…
dev.to — LLM tag TIER_1 · Alan West · 2026-05-11 16:56

TokenSpeed and the Quiet Race to Make LLM Inference Boring

<h2> Another inference engine? </h2> <p>So TokenSpeed is trending on GitHub this week, billing itself as a "speed-of-light LLM inference engine." I clicked through expecting either a vLLM clone or another Rust rewrite of llama.cpp. I haven't run it in production yet — the repo is…
dev.to — LLM tag TIER_1 (CA) · Made Büro · 2026-05-11 15:50

OpenModels: Explore LLM Models and Inference Providers

<p>The number of LLM providers keeps growing and so does the confusion around pricing, availability and compatibility. OpenModels is an open-source project that brings structure to this landscape: a single registry where models, providers, and their relationships are documented, …
Mastodon — fosstodon.org TIER_1 · [email protected] · 2026-05-07 23:52

LightSeek Foundation has released TokenSpeed, an open-source LLM inference engine designed specifically for agentic AI workloads. The engine uses a C++ finite-s

LightSeek Foundation has released TokenSpeed, an open-source LLM inference engine designed specifically for agentic AI workloads. The engine uses a C++ finite-state machine to enforce KV cache safety at compile time and outperformed TensorRT-LLM by around 9-11% on NVIDIA Blackwel…

LINKS marktechpost.com/…/lightseek-foundation-r…
Mastodon — mastodon.social TIER_1 · aihaberleri · 2026-05-07 22:52

📰 TokenSpeed 2026: Open-Source LLM Inference Engine Beats TensorRT-LLM in Agentic Workloads TokenSpeed, a new open-source LLM inference engine from the LightSee

📰 TokenSpeed 2026: Open-Source LLM Inference Engine Beats TensorRT-LLM in Agentic Workloads TokenSpeed, a new open-source LLM inference engine from the LightSeek Foundation, targets TensorRT-LLM-level performance for agentic coding systems. Designed to reduce latency and power co…

LINKS aihaberleri.org/…/tokenspeed-2026-open-so…
Mastodon — mastodon.social TIER_1 Türkçe(TR) · aihaberleri · 2026-05-07 22:51

📰 TokenSpeed 2026: LightSeek Foundation, 60% More Efficient LLM Output Speed for Agentic Workloads ... LightSeek Foundation, meeting the demand of agentic systems

📰 TokenSpeed 2026: LightSeek Foundation, Agentic İş Yükleri İçin LLM Çıktı Hızını %60 Daha Verimli ... LightSeek Foundation, agentic sistemlerin talebini karşılamak için TokenSpeed adlı açık kaynaklı bir LLM çıkarım motorunu serbest bıraktı. Bu teknoloji, TensorRT-LLM seviyesinde…

LINKS aihaberleri.org/…/tokenspeed-2026-lightse…

COVERAGE [29]

RELATED ENTITIES

RELATED TOPICS