PulseAugur
EN
LIVE 13:14:51

LLaMA users debate Q4 vs Q5 quantization for 70B models on 24GB GPUs

A user on the r/LocalLLaMA subreddit is seeking advice on how to choose between Q4 and Q5 quantization levels for a 70 billion parameter model when constrained by 24GB of GPU memory. They are weighing the slight performance improvement of Q5 against the risk of exceeding memory limits, especially for code generation tasks. The user is looking for practical strategies from others who run large models locally to make this decision. AI

IMPACT Users debate practical trade-offs in running large local models, impacting hardware choices and performance expectations.

RANK_REASON User discussion on model quantization trade-offs.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/Practical_Low29 ·

    how do you decide between q4 and q5 on a 70b when 24gb is the cap?

    <!-- SC_OFF --><div class="md"><p>ran into the q4 vs q5 wall again this morning. 70b model. 24gb card. q4 fits with margin, q5 fits if i kill everything else on the gpu and pray.</p> <p>did the math on actual quality difference for my use case (mostly code generation on a private…