A new research paper published on arXiv details a phenomenon called "service-induced congestion" in large language model (LLM) serving. This occurs when the LLM's own memory usage, specifically its key-value cache, grows with each generated token, leading to increased capacity pressure under high concurrency. When memory limits are exceeded, active requests are evicted, wasting computation and reducing overall throughput by up to 50%. The paper proposes a dynamical model to analyze this issue and suggests scheduling design principles to mitigate it, particularly by leveraging workload heterogeneity to stabilize memory-constrained serving. AI
RANK_REASON The cluster contains a research paper published on arXiv detailing a novel technical problem and its analysis.
- arXiv
- graphics processing unit
- Hugging Face
- large language model
- alphaXiv
- CatalyzeX
- DagsHub
- Gotit.pub
- Influence Flower
- ScienceCast
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →