llama.cpp RDNA3: Flash Attention 削减 K 显存占用（使用打包的 8 位 K）

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-31 10:51

一种适用于 RDNA3 GPU 上的 llama.cpp 的新方法，通过将 K 值打包成 8 位整数，然后由 GPU 的原生 `sudot4` 指令处理，显著减少了 KV 缓存的 VRAM 使用量。这种方法在 128k 上下文时可节省约 1.42 GiB 的 VRAM，可能允许更大的上下文适应可用内存。质量指标，包括 Kullback-Leibler 散度和困惑度，与标准的 FP16 K 值相比，仅有微小的下降，表明性能几乎无损。 AI

影响优化本地 LLM 推理，可能在消费级硬件上实现更大的上下文窗口。

排序理由这是针对特定软件和硬件组合的技术优化，并非新的模型发布或重大行业事件。[lever_c_demoted from research: ic=1 ai=0.7]

在 r/LocalLLaMA 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

r/LocalLLaMA TIER_1 English(EN) · /u/DrBearJ3w · 2026-05-31 10:51

Flash Attention for llama.cpp on RDNA3: 47% less KV VRAM than Vulkan f16 K, KLD almost losselss on F16 K / q4_0 V. Part 1.

<div class="md"><p>The normal tradeoff in llama.cpp attention is: quantize your KV cache and lose quality, or keep fp16 and burn VRAM. On RDNA3 there's a third option(from now on)!Pack four 8-bit K values into a single 32-bit and feed them directly to the GPU's nat…

报道来源 [1]

Flash Attention for llama.cpp on RDNA3: 47% less KV VRAM than Vulkan f16 K, KLD almost losselss on F16 K / q4_0 V. Part 1.

相关实体

相关话题