PulseAugur
EN
LIVE 13:25:15
ENTITY llama.cpp

llama.cpp

PulseAugur coverage of llama.cpp — every cluster mentioning llama.cpp across labs, papers, and developer communities, ranked by signal.

Show in brief
Total · 30d
286
286 over 90d
Releases · 30d
0
0 over 90d
Papers · 30d
13
13 over 90d
TIER MIX · 90D
TOPICS
RELATIONSHIPS
TIMELINE
  1. 2026-06-08 research_milestone llama.cpp merged a pull request to optimize KV cache performance for the Gemma-4 model. source
  2. 2026-06-05 product_launch A SYCL backend has been ported to llama.cpp, offering performance improvements for Intel Arc GPUs. source
  3. 2026-05-30 product_launch llama.cpp released version b9438, adding custom CSS injection for web UI theming. source
  4. 2026-05-25 research_milestone A fix is expected for llama.cpp to address split mode tensor crashes. source
  5. 2026-05-25 product_launch A pull request was submitted to improve checkpoint creation and context handling in llama.cpp. source
  6. 2026-05-24 product_launch llama.cpp released version b9305 with pre-compiled binaries for multiple platforms. source
  7. 2026-05-17 research_milestone llama.cpp implements MTP optimizations and prompt decode improvements for faster local AI inference. source
  8. 2026-05-14 product_launch A performance-optimized fork of llama.cpp was released with new features. source
  9. 2026-05-12 product_launch llama.cpp project integrates llama-eval tool for model benchmarking. source
SENTIMENT · 30D

31 day(s) with sentiment data

RECENT · PAGE 4/10 · 200 TOTAL
  1. TOOL · CL_75531 ·

    Gemma 4 QAT MTP heads released, crash fix enables parallel processing

    The Gemma 4 QAT MTP assistant heads have been released on HuggingFace, offering improved performance for speculative decoding. These heads are specifically trained to match the quantized weights of the Gemma 4 models, s…

  2. COMMENTARY · CL_75329 ·

    LocalLLaMA users share 16GB VRAM LLM setups for coding

    Users on the r/LocalLLaMA subreddit are discussing optimal local large language model (LLM) deployments for hardware configurations featuring 16GB of VRAM and 64GB of RAM. The conversation focuses on identifying the bes…

  3. TOOL · CL_75292 ·

    AMD MI50 GPUs show strong performance with llama.cpp on Debian

    A user on Reddit's r/LocalLLaMA shared performance benchmarks for AMD MI50 GPUs running the llama.cpp software on Debian Testing. The benchmarks, conducted using the llama-benchy tool with the unsloth/Qwen3.6-35B-A3B-GG…

  4. TOOL · CL_75291 ·

    Gemma 4 12B model reaches 120 tokens/sec on 12GB VRAM

    A user on Reddit's r/LocalLLaMA subreddit has achieved 120 tokens per second inference speed with Google's Gemma 4 12B model. This was accomplished using a Quantization-Aware Training (QAT) variant of the model, specifi…

  5. TOOL · CL_75040 ·

    StepFun 3.7 Flash model achieves 27.5% faster token generation

    A user has benchmarked the StepFun Step-3.7-Flash model, a large language model with approximately 200 billion total parameters, on an AMD Ryzen AI Max+ 395 APU. The benchmark utilized a patched llama.cpp build with Vul…

  6. MEME · CL_75018 ·

    User seeks advice on dual-GPU setup for local LLM inference

    A user on the r/LocalLLaMA subreddit is seeking advice on configuring a dual-GPU setup for running large language models locally. They plan to combine a new NVIDIA RTX 3090 with their existing RTX 3060 in a ThinkStation…

  7. TOOL · CL_75017 ·

    Gemma 4 QAT models benchmarked on AMD Strix Halo APU

    A user benchmarked Google's Gemma 4 models, specifically the quantization-aware training (QAT) versions, on an AMD Strix Halo APU. The tests utilized llama.cpp with Vulkan/RADV backend to evaluate performance across dif…

  8. TOOL · CL_74972 ·

    Users seek MTP activation for Gemma4 31b model

    Users on the r/LocalLLaMA subreddit are discussing how to activate MTP (likely a quantization or inference technique) for the new QAT Gemma4 31b model in q4_0 GGUF format. The primary question is whether this functional…

  9. MEME · CL_74928 ·

    LocalLLaMA users seek integrated TTS and image models for llama.cpp

    A user on the r/LocalLLaMA subreddit is inquiring about the availability of voice cloning and speech generation models that are compatible with inference engines like llama.cpp or vLLM-Omni. The goal is to integrate the…

  10. TOOL · CL_74745 ·

    Qwen 3.6 27B model performance drops with speculative decoding params

    A user on the r/LocalLLaMA subreddit is experiencing a significant drop in inference speed and GPU utilization when using the Qwen 3.6 27B model with specific parameters related to speculative decoding. When parameters …

  11. TOOL · CL_74606 ·

    DeepSeek V4 Flash model gains early support in llama.cpp

    A pull request is in progress to add support for the DeepSeek V4 Flash model to the llama.cpp library. While currently in an early, slow, and unstable stage, the model is praised for its intelligence relative to its siz…

  12. TOOL · CL_74363 ·

    Run Google's Gemma-4 12B model on WSL2 with llama.cpp

    A guide details how to run Google's Gemma-4 12B model on Windows Subsystem for Linux 2 (WSL2) using the llama.cpp framework. The process involves updating the WSL environment, installing necessary dependencies like buil…

  13. TOOL · CL_74101 ·

    NVIDIA launches Cosmos, MemPalace excels in AI memory, OpenClaw aids local agents

    NVIDIA has launched Cosmos, an open platform for developing physical AI world models, aiming to advance robotics and autonomous systems. Concurrently, MemPalace has emerged as a top-performing open-source system for AI …

  14. TOOL · CL_74011 ·

    Laptop GPU runs Qwen3.6 model with surprising speculative decoding boost

    A user detailed their experience running the Qwen3.6-35B-A3B model on a laptop with an 8GB RTX 4060 GPU. They found that disabling memory mapping (`--no-mmap`), ensuring sufficient VRAM headroom, and closing CPU-intensi…

  15. TOOL · CL_73891 ·

    llama.cpp adds SYCL backend for Intel Arc GPUs, boosting speed

    A pull request has been submitted to the llama.cpp project to port the multi-column MMVQ (Matrix-Matrix Vector Quantization) from a CUDA backend to SYCL. This port aims to improve performance for users with Intel Arc gr…

  16. TOOL · CL_73812 ·

    Gemma 4 12B model fixed for coding with special chat template

    Users on r/LocalLLaMA have discovered that the Gemma 4 model, particularly the 12B parameter version, has issues with tool calling and coding tasks. A specific chat template, available via a GitHub Gist, has been identi…

  17. SIGNIFICANT · CL_73706 ·

    Google releases Gemma 4 QAT checkpoints for faster on-device AI

    Google has released quantization-aware training (QAT) checkpoints for its Gemma 4 models, significantly reducing their memory footprint and increasing inference speed on consumer hardware. These new checkpoints allow fo…

  18. TOOL · CL_73723 ·

    iOS app GenBench enables on-device GGUF model benchmarking

    A new free iOS application called GenBench has been released, allowing users to download, run, and benchmark GGUF models directly on their iPhones and iPads. The app utilizes llama.cpp and Metal for offline operation an…

  19. TOOL · CL_73722 ·

    KV cache RAM offload offers viable alternative for local LLMs

    A user on r/LocalLLaMA explored the performance implications of offloading the KV cache to system RAM instead of VRAM when running large language models locally. By using the `-nkvo` flag in llama.cpp, the user found th…

  20. TOOL · CL_73591 ·

    InferBench app simplifies local LLM performance testing

    A new open-source desktop application called InferBench has been released to help users determine which large language models (LLMs) can run on their local GPUs and at what speed. The tool automates the process of downl…