A user detailed their experience running Google's new Gemma 4 12B model on an older GTX 1080 Ti GPU. They found that the Q4 quantization level achieved a usable speed of around 28 tokens/sec for chat and drafting, fitting within the 8GB VRAM of a single card. However, for more detailed tasks like bioinformatics, the Q4 version produced visible glitches and factual errors, which were resolved by using the Q8 quantization level, albeit at a slower speed and requiring two GPUs. AI
IMPACT Demonstrates that newer, smaller models can be run on older hardware for basic tasks, though higher quantization is needed for accuracy.
RANK_REASON User-level evaluation of a new model on older hardware. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →