llama-bench defaults corrected for flash attention and GPU layers

By PulseAugur Editorial · [1 sources] · 2026-06-18 09:36

A recent build, b9437, for the llama-bench tool has corrected default settings related to flash attention and GPU layer counts. Previously, the tool hard-coded flash attention off, even on compatible hardware, and used a legacy sentinel value for GPU layers. The update now defaults flash attention to automatic activation on capable hardware (CUDA, Metal, Vulkan) and sets the GPU layer count to -1, aligning with other llama.cpp tools like llama-server and llama-cli. This change ensures that benchmarks run with the latest defaults accurately reflect flash attention usage on supported GPUs. AI

IMPACT Ensures accurate benchmarking of flash attention on compatible hardware, improving the reliability of performance metrics for llama.cpp.

RANK_REASON This is a fix to a specific tool's default settings, not a new model release or significant industry event.

Read on dev.to — LLM tag →

infra

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

dev.to — LLM tag TIER_1 English(EN) · Creeta · 2026-06-18 09:36

llama-bench skipped FA on capable GPUs — b9437 corrects it

<h2> What flipped in b9437 </h2> <p>Build <a href="https://github.com/ggml-org/llama.cpp/releases" rel="noopener noreferrer">b9437</a>, published on May 30, 2026 at 20:56 UTC , ships two targeted default-value corrections to <code>llama-bench</code>. Flash attention (<code>-fa</c…

COVERAGE [1]

llama-bench skipped FA on capable GPUs — b9437 corrects it

RELATED ENTITIES

RELATED TOPICS