Cheapest setup for >10 tok/sec for 120B dense LLM
A user on the r/LocalLLaMA subreddit is seeking the most cost-effective hardware configuration to run a 120 billion parameter dense Large Language Model (LLM) at a speed exceeding 10 tokens per second. The user requires this for generating rapid responses in role-playing game campaigns, ideally with a 64,000 token context window and quantized model precision (Q5 or Q6). They are exploring options for CPU-only, GPU-only, and mixed inference setups, noting the significant VRAM requirements for GPU-based solutions. AI