LLM latency split: 48% prefill, 52% decode

By PulseAugur Editorial · [1 sources] · 2026-05-26 23:00

New analysis indicates that approximately 48% of end-to-end latency in large language models is attributed to the prefill stage, with the remaining 52% coming from the decoding process. The prefill stage is further divided into two operations: prefill extend, which involves writing new context and KV tokens, and cache read, which reuses existing KV cache from previous interactions. AI

IMPACT Understanding latency breakdown in LLMs is crucial for optimizing inference speed and cost.

RANK_REASON Analysis of LLM performance characteristics from a third-party source.

Read on X — SemiAnalysis →

SemiAnalysis

infra

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

LLM latency split: 48% prefill, 52% decode

COVERAGE [1]

X — SemiAnalysis TIER_1 English(EN) · SemiAnalysis_ · 2026-05-26 23:00

PDOOM ALERT 🚨 : ~48% of e2e LLM latency is prefill, ~52% is decode. Prefill itself breaks into 2 ops:

PDOOM ALERT 🚨 : ~48% of e2e LLM latency is prefill, ~52% is decode. Prefill itself breaks into 2 ops: 🟠 Prefill extend (cache write) — ingests new context/files, writes fresh KV tokens 🟠 Cache read — reuses existing KV cache from prior turns https://t.co/zzKrZFZKhX

COVERAGE [1]

PDOOM ALERT 🚨 : ~48% of e2e LLM latency is prefill, ~52% is decode. Prefill itself breaks into 2 ops:

RELATED ENTITIES

RELATED TOPICS