How much VRAM do you actually need to run Llama 3 or Gemma locally?
Running large language models like Llama 3 and Gemma locally requires careful consideration of VRAM usage, which extends beyond just model weights to include the KV cache and overhead. The KV cache, crucial for maintaining context during text generation, scales with prompt length and can significantly exceed the memory required for model weights at higher context windows. For instance, Llama 3 8B at a 128K context requires a 24GB card, while Gemma 2 9B demands more VRAM than Llama 3 8B due to a larger KV cache, despite a similar parameter count. AI
IMPACT Understanding VRAM requirements beyond model weights is critical for optimizing local LLM deployment and managing hardware costs.