PulseAugur
EN
LIVE 02:20:41

LLM latency split: 48% prefill, 52% decode

New analysis indicates that approximately 48% of end-to-end latency in large language models is attributed to the prefill stage, with the remaining 52% coming from the decoding process. The prefill stage is further divided into two operations: prefill extend, which involves writing new context and KV tokens, and cache read, which reuses existing KV cache from previous interactions. AI

IMPACT Understanding latency breakdown in LLMs is crucial for optimizing inference speed and cost.

RANK_REASON Analysis of LLM performance characteristics from a third-party source.

Read on X — SemiAnalysis →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

LLM latency split: 48% prefill, 52% decode

COVERAGE [1]

  1. X — SemiAnalysis TIER_1 English(EN) · SemiAnalysis_ ·

    PDOOM ALERT 🚨 : ~48% of e2e LLM latency is prefill, ~52% is decode. Prefill itself breaks into 2 ops:

    PDOOM ALERT 🚨 : ~48% of e2e LLM latency is prefill, ~52% is decode. Prefill itself breaks into 2 ops: 🟠 Prefill extend (cache write) — ingests new context/files, writes fresh KV tokens 🟠 Cache read — reuses existing KV cache from prior turns https://t.co/zzKrZFZKhX