PulseAugur
EN
LIVE 12:58:18

vLLM continuous batching causes p99 latency spikes for Llama 3.3

A developer at Nexus Labs encountered significant latency issues after enabling continuous batching in vLLM for their Llama 3.3 70B model. While throughput initially improved, p99 latency increased eightfold, impacting their service level objectives. The problem was traced to long prefill requests blocking decode operations within the same forward pass. AI

IMPACT Highlights a common trade-off in LLM serving infrastructure, where throughput gains from features like continuous batching can negatively impact latency-sensitive applications.

RANK_REASON This is a technical post about optimizing an existing tool (vLLM) for a specific model (Llama 3.3) and workload, rather than a new release or major industry event.

Read on dev.to — LLM tag →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. dev.to — LLM tag TIER_1 English(EN) · Marcus Chen ·

    Continuous batching wrecked our p99 latency. Here's the trace.

    <p><strong>TL;DR: We turned on vLLM continuous batching for a throughput win and watched p99 latency 8x in the wrong direction. Long prefills were stalling decodes in the same forward pass. Chunked prefill and a tuned <code>max_num_batched_tokens</code> got the SLO back at the co…