New research explores LLM security, efficiency, and training optimization

By PulseAugur Editorial · [32 sources] · 2026-04-27 04:00

Researchers are developing novel methods to enhance the efficiency and security of Large Language Models (LLMs). One approach, "Widening the Gap," exploits outlier injection to compromise LLM quantization, demonstrating that security risks extend to advanced quantization techniques like AWQ and GPTQ. Concurrently, other studies focus on optimizing LLM inference through adaptive quantization (XFP), speculative decoding with device-edge collaboration (GELATO), and efficient KV cache management (SparKV, Feather, Dooly). Additionally, new frameworks are emerging for analyzing LLM inference stability (Queueing-Theoretic Framework) and improving data optimization for model training (CAMEL). AI

IMPACT Advancements in LLM quantization security, inference efficiency, and training data optimization are crucial for broader and more secure AI deployment.

RANK_REASON Multiple arXiv papers published on LLM-related topics including security, quantization, inference optimization, and training.

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 32 sources. How we write summaries →

New research explores LLM security, efficiency, and training optimization

COVERAGE [32]

arXiv cs.AI TIER_1 English(EN) · Martin Vechev · 2026-05-14 17:50

Widening the Gap: Exploiting LLM Quantization via Outlier Injection

LLM quantization has become essential for memory-efficient deployment. Recent work has shown that quantization schemes can pose critical security risks: an adversary may release a model that appears benign in full precision but exhibits malicious behavior once quantized by users.…
arXiv cs.AI TIER_1 English(EN) · Thomas Witt · 2026-05-14 13:52

XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference

We introduce XFP, a dynamic weight quantizer for LLM inference that inverts the conventional workflow: the operator specifies reconstruction quality floors on per-channel cosine similarity (one strict floor for attention and shared experts, one lazy floor for routed-expert MoE); …
arXiv cs.LG TIER_1 English(EN) · Bo Ai · 2026-05-11 07:38

GELATO: Generative Entropy- and Lyapunov-based Adaptive Token Offloading for Device-Edge Speculative LLM Inference

The recent growth of on-device Large Language Model (LLM) inference has driven significant interest in device-edge collaborative LLM inference. As a promising architecture, Speculative Decoding (SD) is increasingly adopted where a lightweight draft model rapidly generates candida…
arXiv cs.AI TIER_1 English(EN) · Daehyeok Kim · 2026-05-08 16:44

Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation

Selecting the optimal LLM inference configuration requires evaluation across hardware, serving engines, attention backends, and model architectures, since no single choice performs best across all workloads. Profile-based simulators are the standard tool, yet they hardcode their …
arXiv cs.LG TIER_1 English(EN) · Hirofumi Ota, Naoto Iwase, Yuki Ichihara, Junpei Komiyama, Masaaki Imaizumi · 2026-05-08 04:00

CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency

arXiv:2605.05873v1 Announce Type: cross Abstract: Large language models often improve reasoning by sampling multiple outputs and aggregating their final answers, but precise and efficient control of error levels remains a challenging task. In particular, deciding when to stop sam…
arXiv cs.LG TIER_1 English(EN) · Mikhail Shirokikh, Sergey Nikolenko · 2026-05-08 04:00

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

arXiv:2605.05219v1 Announce Type: new Abstract: Prefix caching is a key latency optimization for autoregressive LLM serving, yet existing systems assume dense per-token key/value reuse. State-space models change the structure of the problem: a recurrent layer can resume from a si…
arXiv cs.LG TIER_1 English(EN) · Saksham Rathi, Preeti, Mythili Vutukuru · 2026-05-08 04:00

Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference

arXiv:2605.06046v1 Announce Type: new Abstract: Auto-regressive token generation in large language models is memory-bound because it requires "attending to" key and value tensors (KV cache) of all previous tokens. Prior work aims to improve the efficiency of this decode process b…
arXiv cs.LG TIER_1 English(EN) · Jingwei Li, Xinran Gu, Jingzhao Zhang · 2026-05-07 04:00

Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization

arXiv:2603.08022v2 Announce Type: replace Abstract: A data mixture refers to how different data sources are combined to train large language models, and selecting an effective mixture is crucial for optimal downstream performance. Existing methods either conduct costly searches d…
arXiv cs.AI TIER_1 English(EN) · Hongyao Liu, Liuqun Zhai, Junyi Wang, Zhengru Fang · 2026-05-07 04:00

SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference

arXiv:2604.21231v2 Announce Type: replace-cross Abstract: Efficient inference for on-device Large Language Models (LLMs) remains challenging due to limited hardware resources and the high cost of the prefill stage, which processes the full input context to construct Key-Value (KV…
arXiv cs.LG TIER_1 English(EN) · Chengyi Nie, Nian Si, Zijie Zhou · 2026-05-07 04:00

A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints

arXiv:2605.04595v1 Announce Type: new Abstract: The rapid adoption of large language models (LLMs) has created significant challenges for efficient inference at scale. Unlike traditional workloads, LLM inference is constrained by both computation and the memory overhead of key-va…
arXiv cs.AI TIER_1 English(EN) · Shakya Jayakody, Youpeng Zhao, Chinmay Dhanraj Nehate, Jun Wang · 2026-05-06 04:00

GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving

arXiv:2605.00831v1 Announce Type: cross Abstract: The rise of million-token, agent-based applications has placed unprecedented demands on large language model (LLM) inference services. The long-running nature of these tasks increases their susceptibility to hardware and software …
arXiv cs.LG TIER_1 English(EN) · Itamar Zimerman, Allon Adir, Ehud Aharoni, Matan Avitan, Moran Baruch, Nir Drucker, Jenny Lerner, Ramy Masalha, Reut Meiri, Omri Soceanu · 2026-05-06 04:00

Power-Softmax: Towards Secure LLM Inference over Encrypted Data

arXiv:2410.09457v2 Announce Type: replace Abstract: Modern cryptographic methods for implementing privacy-preserving LLMs such as \gls{HE} require the LLMs to have a polynomial form. Forming such a representation is challenging because transformers include non-polynomial componen…
arXiv cs.LG TIER_1 English(EN) · Zhibin Wang, Zetao Hong, Xue Li, Zibo Wang, Shipeng Li, Qingkai Meng, Qing Wang, Chengying Huan, Rong Gu, Sheng Zhong, Chen Tian · 2026-05-05 04:00

STAR: Decode-Phase Rescheduling for LLM Inference

arXiv:2510.13668v2 Announce Type: replace-cross Abstract: Large Language Model (LLM) inference has emerged as a fundamental paradigm, however, variations in output length cause severe workload imbalance in the decode phase, particularly for long-output reasoning tasks. Existing s…
arXiv cs.CL TIER_1 English(EN) · Jinyu Guo, Zhihan Zhang, Jiehui Xie, Md. Tamim Iqbal, Dongshen Han, Lik-Hang Lee, Sung-Ho Bae, Jie Zou, Yang Yang, Chaoning Zhang · 2026-05-05 04:00

DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

arXiv:2604.19351v3 Announce Type: replace Abstract: The quadratic computational complexity of the standard attention mechanism constitutes a fundamental bottleneck for large language models in long-context inference. While existing KV cache compression methods alleviate memory pr…
arXiv cs.LG TIER_1 English(EN) · Yuzong Chen, Chao Fang, Xilai Dai, Yuheng Wu, Thierry Tambe, Marian Verhelst, Mohamed S. Abdelfattah · 2026-05-05 04:00

P3-LLM: An Integrated NPU-PIM Accelerator for Edge LLM Inference Using Hybrid Numerical Formats

arXiv:2511.06838v4 Announce Type: replace-cross Abstract: The substantial memory bandwidth and computational demands of large language models (LLMs) present critical challenges for efficient inference. To tackle this, the literature has explored heterogeneous systems that combine…
arXiv cs.LG TIER_1 English(EN) · Raja Gond, Nipun Kwatra, Ramachandran Ramjee · 2026-05-04 04:00

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

arXiv:2505.11329v5 Announce Type: replace-cross Abstract: Distributed inference of large language models (LLMs) using tensor parallelism can introduce communication overheads of $20$% even over GPUs connected via NVLink, a high-speed GPU interconnect. Several techniques have been…
arXiv cs.LG TIER_1 English(EN) · Hongtao Xu, Jianchao Tan, Yuxuan Hu, Pengju Lu, Hongyu Wang, Pingwei Sun, Yerui Sun, Yuchen Xie, Xunliang Cai, Mingzhen Li, Weile Jia · 2026-04-27 04:00

SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention

arXiv:2604.13847v2 Announce Type: replace Abstract: While sparse attention mitigates the computational bottleneck of long-context LLM training, its distributed training process exhibits extreme heterogeneity in both \textit{1)} sequence length and \textit{2)} sparsity sensitivity…
arXiv cs.AI TIER_1 English(EN) · Zahra Yousefijamarani, Xinglu Wang, Qian Wang, Morgan Lindsay Heisler, Taha Shabani, Niloofar Gholipour, Parham Yassini, Hong Chang, Kan Chen, Qiantao Zhang, Xiaolong Bai, Jiannan Wang, Ying Xiong, Yong Zhang, Zhenan Fan · 2026-04-27 04:00

HFX: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling

arXiv:2508.15919v3 Announce Type: replace-cross Abstract: Large language model (LLM) serving faces the dual challenge of meeting strict user-specific service-level objectives (SLOs) while minimizing computational cost under dynamic, multi-task workloads. Existing approaches eithe…
arXiv stat.ML TIER_1 English(EN) · Fangzheng Miao · 2026-05-13 09:49

Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

Quantization is essential for efficient large language model (LLM) inference, yet the dequantization step-converting low-bit weights back to high-precision for matrix multiplication has become a critical bottleneck on modern AI accelerators. On architectures with decoupled comput…
arXiv cs.CV TIER_1 English(EN) · Zhiling Lan · 2026-05-11 13:31

EnergyLens: Interpretable Closed-Form Energy Models for Multimodal LLM Inference Serving

As large language models span dense, mixture-of-experts, and state-space architectures and are deployed on heterogeneous accelerators under increasingly diverse multimodal workloads, optimising inference energy has become as critical as optimizing latency and throughput. Existing…
arXiv stat.ML TIER_1 English(EN) · Masaaki Imaizumi · 2026-05-07 08:41

CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency

Large language models often improve reasoning by sampling multiple outputs and aggregating their final answers, but precise and efficient control of error levels remains a challenging task. In particular, deciding when to stop sampling remains difficult when the stopping rule is …
Hacker News — AI stories ≥50 points TIER_1 English(EN) · mitchwainer · 2026-05-05 14:37

SubQ: a sub-quadratic LLM with 12M-token context
MarkTechPost TIER_1 English(EN) · Asif Razzaq · 2026-05-07 22:03

LightSeek Foundation Releases TokenSpeed, an Open-Source LLM Inference Engine Targeting TensorRT-LLM-Level Performance for Agentic Workloads

<p>Inference efficiency has quietly become one of the most consequential bottlenecks in AI deployment. As agentic coding systems such as Claude Code, Codex, and Cursor scale from developer tools to infrastructure powering software development at large, the underlying inference en…
Medium — MLOps tag TIER_1 English(EN) · Tensormesh · 2026-05-08 20:53

Tensormesh Inference: Cheaper LLM Inference for AI Agents

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@tensormesh/tensormesh-inference-cheaper-llm-inference-for-ai-agents-a7fb7eba49a8?source=rss------mlops-5"><img src="https://cdn-images-1.medium.com/max/1216/0*H22cd4Xun81j4pX5.png" width="1216…
Towards AI TIER_1 English(EN) · Kashif Mehmood · 2026-05-08 20:01

Understanding KV Cache in LLMs and How It Affects Inference

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/understanding-kv-cache-in-llms-and-how-it-affects-inference-a59c8860a57c?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/1536/1*pmsPEhyC3UIeRCFjdXSDzw.png" …
Medium — MLOps tag TIER_1 English(EN) · Rajesh Balaji · 2026-05-08 06:06

Optimizing AI Performance: Modern Techniques for Efficient LLM Tuning and Inference

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@rajeshbalaji/optimizing-ai-performance-a-comprehensive-guide-to-modern-model-tuning-techniques-61de99b2286a?source=rss------mlops-5"><img src="https://cdn-images-1.medium.com/max/724/1*_tw32nE…
dev.to — LLM tag TIER_1 English(EN) · 丁久 · 2026-05-12 11:29

Model Quantization: Making LLMs Smaller and Faster

<blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/model-quantization.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</em><…
dev.to — LLM tag TIER_1 English(EN) · Alan West · 2026-05-11 16:56

TokenSpeed and the Quiet Race to Make LLM Inference Boring

<h2> Another inference engine? </h2> <p>So TokenSpeed is trending on GitHub this week, billing itself as a "speed-of-light LLM inference engine." I clicked through expecting either a vLLM clone or another Rust rewrite of llama.cpp. I haven't run it in production yet — the repo is…
dev.to — LLM tag TIER_1 (CA) · Made Büro · 2026-05-11 15:50

OpenModels: Explore LLM Models and Inference Providers

<p>The number of LLM providers keeps growing and so does the confusion around pricing, availability and compatibility. OpenModels is an open-source project that brings structure to this landscape: a single registry where models, providers, and their relationships are documented, …
Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-05-07 23:52

LightSeek Foundation has released TokenSpeed, an open-source LLM inference engine designed specifically for agentic AI workloads. The engine uses a C++ finite-s

LightSeek Foundation has released TokenSpeed, an open-source LLM inference engine designed specifically for agentic AI workloads. The engine uses a C++ finite-state machine to enforce KV cache safety at compile time and outperformed TensorRT-LLM by around 9-11% on NVIDIA Blackwel…

LINKS marktechpost.com/…/lightseek-foundation-r…
Mastodon — mastodon.social TIER_1 English(EN) · aihaberleri · 2026-05-07 22:52

📰 TokenSpeed 2026: Open-Source LLM Inference Engine Beats TensorRT-LLM in Agentic Workloads TokenSpeed, a new open-source LLM inference engine from the LightSee

📰 TokenSpeed 2026: Open-Source LLM Inference Engine Beats TensorRT-LLM in Agentic Workloads TokenSpeed, a new open-source LLM inference engine from the LightSeek Foundation, targets TensorRT-LLM-level performance for agentic coding systems. Designed to reduce latency and power co…

LINKS aihaberleri.org/…/tokenspeed-2026-open-so…
Mastodon — mastodon.social TIER_1 Türkçe(TR) · aihaberleri · 2026-05-07 22:51

📰 TokenSpeed 2026: LightSeek Foundation, 60% More Efficient LLM Output Speed for Agentic Workloads ... LightSeek Foundation, meeting the demand of agentic systems

📰 TokenSpeed 2026: LightSeek Foundation, Agentic İş Yükleri İçin LLM Çıktı Hızını %60 Daha Verimli ... LightSeek Foundation, agentic sistemlerin talebini karşılamak için TokenSpeed adlı açık kaynaklı bir LLM çıkarım motorunu serbest bıraktı. Bu teknoloji, TensorRT-LLM seviyesinde…

LINKS aihaberleri.org/…/tokenspeed-2026-lightse…

COVERAGE [32]

RELATED ENTITIES

RELATED TOPICS