The Request Is the Wrong Unit of Scale for LLMs on Kubernetes
The traditional web application scaling model, which relies on request counts, is insufficient for serving large language models (LLMs). LLM workloads vary significantly in complexity based on the number of input and output tokens, not just the number of HTTP requests. This distinction is crucial because input tokens impact the time to first token, while output tokens affect the overall processing time and system capacity, leading to potential performance issues even when request metrics appear stable. AI
IMPACT Highlights the need for new scaling metrics beyond request counts for efficient LLM deployment.