PulseAugur
EN
LIVE 10:17:30

User seeks cheapest hardware for fast 120B LLM inference

A user on the r/LocalLLaMA subreddit is seeking the most cost-effective hardware configuration to run a 120 billion parameter dense Large Language Model (LLM) at a speed exceeding 10 tokens per second. The user requires this for generating rapid responses in role-playing game campaigns, ideally with a 64,000 token context window and quantized model precision (Q5 or Q6). They are exploring options for CPU-only, GPU-only, and mixed inference setups, noting the significant VRAM requirements for GPU-based solutions. AI

RANK_REASON This is a user question on a specific hardware setup for running LLMs locally, not a significant industry announcement or development.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/TrainingTwo1118 ·

    Cheapest setup for >10 tok/sec for 120B dense LLM

    <!-- SC_OFF --><div class="md"><p>Hi all, I'm trying to wrap my head around hardware variables when it comes to LLM, and I have another question: what would be the cheapest way to run a 120B <strong>dense</strong> LLM at &gt;10 tok/sec? I'm fine with Q5, ideally Q6 though.</p> <p…