PulseAugur
EN
LIVE 21:44:57

llama.cpp RDNA3: Flash Attention cuts KV VRAM with packed 8-bit K

A new method for llama.cpp on RDNA3 GPUs significantly reduces KV cache VRAM usage by packing K values into 8-bit integers, which are then processed by the GPU's native `sudot4` instruction. This approach offers a VRAM saving of approximately 1.42 GiB at 128k context, potentially allowing larger contexts to fit within available memory. Quality metrics, including Kullback-Leibler divergence and perplexity, show minimal degradation compared to standard FP16 K values, indicating near-lossless performance. AI

IMPACT Optimizes local LLM inference, potentially enabling larger context windows on consumer hardware.

RANK_REASON This is a technical optimization for a specific software and hardware combination, not a new model release or major industry event. [lever_c_demoted from research: ic=1 ai=0.7]

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/DrBearJ3w ·

    Flash Attention for llama.cpp on RDNA3: 47% less KV VRAM than Vulkan f16 K, KLD almost losselss on F16 K / q4_0 V. Part 1.

    <!-- SC_OFF --><div class="md"><p>The normal tradeoff in llama.cpp attention is: quantize your KV cache and lose quality, or keep fp16 and burn VRAM. On RDNA3 there's a third option(from now on)!Pack four 8-bit K values into a single 32-bit and feed them directly to the GPU's nat…