PulseAugur
EN
LIVE 11:32:13

LLaMA.cpp users seek VRAM optimization beyond tensor-split

A user on the r/LocalLLaMA subreddit is seeking more efficient methods for optimizing VRAM usage with llama.cpp, particularly for Mixture of Experts (MoE) models across multiple GPUs. They currently rely on manual adjustments of `--ngl` and `--tensor-split` parameters, which is time-consuming and leaves unused VRAM. The user is inquiring about advanced techniques beyond `--tensor-split` to maximize VRAM utilization for better speed and model loading. AI

IMPACT Users are exploring ways to maximize hardware efficiency for running large models locally.

RANK_REASON User discussion on optimizing existing tools, not a new release or significant development.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/GregoryfromtheHood ·

    Are there more easy techniques than --tensor-split to fill VRAM in llama.cpp?

    <!-- SC_OFF --><div class="md"><p>Using 4 GPUs with llama.cpp, with MoE models mainly, I try to fit as much in VRAM as I can. --fit does a terrible job and always causes oom by trying to put way too much on 1 gpu or stupid things like that, so I do --ngl 999 and --n-cpu-moe and a…