This article details a cost-effective method for running large language models locally using two used NVIDIA RTX 3090 graphics cards, offering a total of 48GB of VRAM. The setup allows for inference of 70B parameter models at a rate of 18-22 tokens per second, which is deemed sufficient for interactive chat. The guide emphasizes that NVLink is unnecessary and that standard software like Ollama or llama.cpp can manage the dual-GPU configuration effectively, with specific instructions provided for each. AI
IMPACT Enables cost-effective local LLM inference for users with budget constraints.
RANK_REASON The article provides a practical guide for setting up consumer hardware for a specific AI task, rather than announcing a new model or research.
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →