New research explores LLM security, efficiency, and training optimization
ByPulseAugur Editorial·
Summary by gemini-2.5-flash-lite
from 32 sources
Researchers are developing novel methods to enhance the efficiency and security of Large Language Models (LLMs). One approach, "Widening the Gap," exploits outlier injection to compromise LLM quantization, demonstrating that security risks extend to advanced quantization techniques like AWQ and GPTQ. Concurrently, other studies focus on optimizing LLM inference through adaptive quantization (XFP), speculative decoding with device-edge collaboration (GELATO), and efficient KV cache management (SparKV, Feather, Dooly). Additionally, new frameworks are emerging for analyzing LLM inference stability (Queueing-Theoretic Framework) and improving data optimization for model training (CAMEL).
AI
IMPACT
Advancements in LLM quantization security, inference efficiency, and training data optimization are crucial for broader and more secure AI deployment.
RANK_REASON
Multiple arXiv papers published on LLM-related topics including security, quantization, inference optimization, and training.
LLM quantization has become essential for memory-efficient deployment. Recent work has shown that quantization schemes can pose critical security risks: an adversary may release a model that appears benign in full precision but exhibits malicious behavior once quantized by users.…
We introduce XFP, a dynamic weight quantizer for LLM inference that inverts the conventional workflow: the operator specifies reconstruction quality floors on per-channel cosine similarity (one strict floor for attention and shared experts, one lazy floor for routed-expert MoE); …
The recent growth of on-device Large Language Model (LLM) inference has driven significant interest in device-edge collaborative LLM inference. As a promising architecture, Speculative Decoding (SD) is increasingly adopted where a lightweight draft model rapidly generates candida…
Selecting the optimal LLM inference configuration requires evaluation across hardware, serving engines, attention backends, and model architectures, since no single choice performs best across all workloads. Profile-based simulators are the standard tool, yet they hardcode their …
arXiv:2605.05873v1 Announce Type: cross Abstract: Large language models often improve reasoning by sampling multiple outputs and aggregating their final answers, but precise and efficient control of error levels remains a challenging task. In particular, deciding when to stop sam…
arXiv:2605.05219v1 Announce Type: new Abstract: Prefix caching is a key latency optimization for autoregressive LLM serving, yet existing systems assume dense per-token key/value reuse. State-space models change the structure of the problem: a recurrent layer can resume from a si…
arXiv:2605.06046v1 Announce Type: new Abstract: Auto-regressive token generation in large language models is memory-bound because it requires "attending to" key and value tensors (KV cache) of all previous tokens. Prior work aims to improve the efficiency of this decode process b…
arXiv:2603.08022v2 Announce Type: replace Abstract: A data mixture refers to how different data sources are combined to train large language models, and selecting an effective mixture is crucial for optimal downstream performance. Existing methods either conduct costly searches d…
arXiv:2604.21231v2 Announce Type: replace-cross Abstract: Efficient inference for on-device Large Language Models (LLMs) remains challenging due to limited hardware resources and the high cost of the prefill stage, which processes the full input context to construct Key-Value (KV…
arXiv cs.LG
TIER_1·Chengyi Nie, Nian Si, Zijie Zhou·
arXiv:2605.04595v1 Announce Type: new Abstract: The rapid adoption of large language models (LLMs) has created significant challenges for efficient inference at scale. Unlike traditional workloads, LLM inference is constrained by both computation and the memory overhead of key-va…
arXiv:2605.00831v1 Announce Type: cross Abstract: The rise of million-token, agent-based applications has placed unprecedented demands on large language model (LLM) inference services. The long-running nature of these tasks increases their susceptibility to hardware and software …
arXiv:2410.09457v2 Announce Type: replace Abstract: Modern cryptographic methods for implementing privacy-preserving LLMs such as \gls{HE} require the LLMs to have a polynomial form. Forming such a representation is challenging because transformers include non-polynomial componen…
arXiv:2510.13668v2 Announce Type: replace-cross Abstract: Large Language Model (LLM) inference has emerged as a fundamental paradigm, however, variations in output length cause severe workload imbalance in the decode phase, particularly for long-output reasoning tasks. Existing s…
arXiv:2604.19351v3 Announce Type: replace Abstract: The quadratic computational complexity of the standard attention mechanism constitutes a fundamental bottleneck for large language models in long-context inference. While existing KV cache compression methods alleviate memory pr…
arXiv cs.LG
TIER_1·Yuzong Chen, Chao Fang, Xilai Dai, Yuheng Wu, Thierry Tambe, Marian Verhelst, Mohamed S. Abdelfattah·
arXiv:2511.06838v4 Announce Type: replace-cross Abstract: The substantial memory bandwidth and computational demands of large language models (LLMs) present critical challenges for efficient inference. To tackle this, the literature has explored heterogeneous systems that combine…
arXiv:2505.11329v5 Announce Type: replace-cross Abstract: Distributed inference of large language models (LLMs) using tensor parallelism can introduce communication overheads of $20$% even over GPUs connected via NVLink, a high-speed GPU interconnect. Several techniques have been…
arXiv:2604.13847v2 Announce Type: replace Abstract: While sparse attention mitigates the computational bottleneck of long-context LLM training, its distributed training process exhibits extreme heterogeneity in both \textit{1)} sequence length and \textit{2)} sparsity sensitivity…
arXiv cs.AI
TIER_1·Zahra Yousefijamarani, Xinglu Wang, Qian Wang, Morgan Lindsay Heisler, Taha Shabani, Niloofar Gholipour, Parham Yassini, Hong Chang, Kan Chen, Qiantao Zhang, Xiaolong Bai, Jiannan Wang, Ying Xiong, Yong Zhang, Zhenan Fan·
arXiv:2508.15919v3 Announce Type: replace-cross Abstract: Large language model (LLM) serving faces the dual challenge of meeting strict user-specific service-level objectives (SLOs) while minimizing computational cost under dynamic, multi-task workloads. Existing approaches eithe…
Quantization is essential for efficient large language model (LLM) inference, yet the dequantization step-converting low-bit weights back to high-precision for matrix multiplication has become a critical bottleneck on modern AI accelerators. On architectures with decoupled comput…
As large language models span dense, mixture-of-experts, and state-space architectures and are deployed on heterogeneous accelerators under increasingly diverse multimodal workloads, optimising inference energy has become as critical as optimizing latency and throughput. Existing…
Large language models often improve reasoning by sampling multiple outputs and aggregating their final answers, but precise and efficient control of error levels remains a challenging task. In particular, deciding when to stop sampling remains difficult when the stopping rule is …
Hacker News — AI stories ≥50 points
TIER_1·mitchwainer·
<p>Inference efficiency has quietly become one of the most consequential bottlenecks in AI deployment. As agentic coding systems such as Claude Code, Codex, and Cursor scale from developer tools to infrastructure powering software development at large, the underlying inference en…
<blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/model-quantization.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</em><…
<h2> Another inference engine? </h2> <p>So TokenSpeed is trending on GitHub this week, billing itself as a "speed-of-light LLM inference engine." I clicked through expecting either a vLLM clone or another Rust rewrite of llama.cpp. I haven't run it in production yet — the repo is…
<p>The number of LLM providers keeps growing and so does the confusion around pricing, availability and compatibility. OpenModels is an open-source project that brings structure to this landscape: a single registry where models, providers, and their relationships are documented, …
LightSeek Foundation has released TokenSpeed, an open-source LLM inference engine designed specifically for agentic AI workloads. The engine uses a C++ finite-state machine to enforce KV cache safety at compile time and outperformed TensorRT-LLM by around 9-11% on NVIDIA Blackwel…
📰 TokenSpeed 2026: Open-Source LLM Inference Engine Beats TensorRT-LLM in Agentic Workloads TokenSpeed, a new open-source LLM inference engine from the LightSeek Foundation, targets TensorRT-LLM-level performance for agentic coding systems. Designed to reduce latency and power co…
📰 TokenSpeed 2026: LightSeek Foundation, Agentic İş Yükleri İçin LLM Çıktı Hızını %60 Daha Verimli ... LightSeek Foundation, agentic sistemlerin talebini karşılamak için TokenSpeed adlı açık kaynaklı bir LLM çıkarım motorunu serbest bıraktı. Bu teknoloji, TensorRT-LLM seviyesinde…