Brief · PulseAugur

TOOL · r/LocalLLaMA (TL) · 4h

Pipeline parallelism in llama.cpp may be wasting your VRAM

A user discovered that the default pipeline parallelism in llama.cpp may be wasting VRAM without providing any speed benefits. By compiling llama.cpp with the flag -DGGML_SCHED_MAX_COPIES=1, users can avoid this unnecessary VRAM allocation. This optimization is particularly relevant when all model layers are offloaded to the GPU. AI

IMPACT Users can reclaim VRAM by disabling default pipeline parallelism in llama.cpp, potentially allowing for larger models or contexts.

llama.cpp
VRAM
GGML_SCHED_MAX_COPIES=1