PulseAugur
EN
LIVE 16:27:39

New research identifies "service-induced congestion" impacting LLM serving throughput

A new research paper published on arXiv details a phenomenon called "service-induced congestion" in large language model (LLM) serving. This occurs when the LLM's own memory usage, specifically its key-value cache, grows with each generated token, leading to increased capacity pressure under high concurrency. When memory limits are exceeded, active requests are evicted, wasting computation and reducing overall throughput by up to 50%. The paper proposes a dynamical model to analyze this issue and suggests scheduling design principles to mitigate it, particularly by leveraging workload heterogeneity to stabilize memory-constrained serving. AI

RANK_REASON The cluster contains a research paper published on arXiv detailing a novel technical problem and its analysis.

Read on arXiv stat.ML →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Ruicheng Ao, Jing Dong, Gan Luo, David Simchi-Levi ·

    Service-Induced Congestion in Memory-Constrained LLM Serving

    arXiv:2606.15555v1 Announce Type: cross Abstract: In large language model (LLM) serving, each request accumulates persistent graphics processing unit (GPU) memory during service as its key-value cache grows with every generated token. Under high concurrency, aggregate memory usag…

  2. arXiv stat.ML TIER_1 English(EN) · David Simchi-Levi ·

    Service-Induced Congestion in Memory-Constrained LLM Serving

    In large language model (LLM) serving, each request accumulates persistent graphics processing unit (GPU) memory during service as its key-value cache grows with every generated token. Under high concurrency, aggregate memory usage therefore increases endogenously over time: the …