PulseAugur / Brief
EN
LIVE 22:54:02

Brief

last 24h
[1/1] 223 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

  1. 120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

    A user on Reddit's r/LocalLLaMA subreddit has achieved 120 tokens per second inference speed with Google's Gemma 4 12B model. This was accomplished using a Quantization-Aware Training (QAT) variant of the model, specifically a GGUF format, running on a system with 12GB of VRAM. The setup involved a patched version of llama.cpp and specific model files, demonstrating efficient local execution of a large language model on consumer hardware. AI

    IMPACT Demonstrates efficient local LLM inference on consumer hardware, potentially lowering barriers for developers.