llama.cpp
PulseAugur coverage of llama.cpp — every cluster mentioning llama.cpp across labs, papers, and developer communities, ranked by signal.
- 2026-06-08 research_milestone llama.cpp merged a pull request to optimize KV cache performance for the Gemma-4 model. source
- 2026-06-05 product_launch A SYCL backend has been ported to llama.cpp, offering performance improvements for Intel Arc GPUs. source
- 2026-05-30 product_launch llama.cpp released version b9438, adding custom CSS injection for web UI theming. source
- 2026-05-25 research_milestone A fix is expected for llama.cpp to address split mode tensor crashes. source
- 2026-05-25 product_launch A pull request was submitted to improve checkpoint creation and context handling in llama.cpp. source
- 2026-05-24 product_launch llama.cpp released version b9305 with pre-compiled binaries for multiple platforms. source
- 2026-05-17 research_milestone llama.cpp implements MTP optimizations and prompt decode improvements for faster local AI inference. source
- 2026-05-14 product_launch A performance-optimized fork of llama.cpp was released with new features. source
- 2026-05-12 product_launch llama.cpp project integrates llama-eval tool for model benchmarking. source
31 day(s) with sentiment data
-
LLM Inference Handbook Explains Token Generation and Optimization
This handbook delves into the engineering discipline of Large Language Model (LLM) inference, explaining how models generate tokens and the advanced optimization techniques used in production systems. It covers fundamen…
-
LLM inference throttles due to hidden VRAM overheating
Modern operating systems fail to report critical VRAM temperatures, instead showing the GPU core temperature, which can lead to performance degradation in local LLM inference. This telemetry gap is particularly problema…
-
Local LLM Guide Updated with Gemma 4 Speed Boosts and Diagram Tools
Thomas Bley has updated his "Run LLMs Locally" presentation with new examples and performance improvements. The update includes a demonstration of creating Mermaid diagrams within the llama.cpp UI and introduces Quantiz…
-
Luce Spark enables 35B MoE models on 16GB GPUs
Luce Spark is a new open-source system that enables large 35 billion parameter Mixture-of-Experts (MoE) models to run on a single 16 GB GPU. It achieves this by intelligently keeping only the currently active experts on…
-
Local LLM Speed Boosted by Gemma 4 MTP and QAT
A recent update to the "Run LLMs Locally" project has introduced Multi-Token-Prediction (MTP) for Gemma models, achieving speed improvements of up to 90% in token generation. This optimization, combined with Quantizatio…
-
llama.cpp adds video input support for local AI models
A pull request has been submitted to the llama.cpp project to integrate video input capabilities into the mtmd tool. This update would allow users to process and analyze video content using local large language models l…
-
llama.cpp optimizes KV cache for Gemma-4 performance
The llama.cpp project has merged a pull request that optimizes KV cache performance, specifically for the Gemma-4 model. This change, available in version b9551 and later, aims to reduce memory copies associated with KV…
-
Pakistan Notice Helper uses small AI to flag scam messages
A new AI tool called Pakistan Notice Helper has been developed to assist users in Pakistan in identifying potentially fraudulent messages. The tool analyzes text or screenshots, providing a risk label, explanation of re…
-
RTX 3090 causes Windows crashes when running AI models
A user on the r/LocalLLaMA subreddit is experiencing frequent Windows crashes when running AI models on their RTX 3090 graphics card. The crashes occur under heavy load, even when VRAM utilization is not a factor, and p…
-
New coding benchmark reveals agent limitations; Kimi launches desktop product
The AI news landscape saw significant developments in coding benchmarks and agent development. Cognition introduced FrontierCode, a new benchmark that evaluates code mergeability and maintainability, revealing that even…
-
llama.cpp updates SYCL compute runtime to v26.x in Docker
The llama.cpp project has released version b9554, which includes an update to its SYCL compute runtime to version 26.x within its Docker environment. This update also adds a comment detailing the old driver configuratio…
-
Developer runs LLM inference on Samsung Galaxy Z Fold6
A developer has created an Android application called Pocket Node that enables local inference of large language models on a Samsung Galaxy Z Fold6. The app utilizes llama.cpp with a Vulkan backend for efficient process…
-
Pi AI agent framework criticized for not supporting local LLMs
A Reddit user argues that the AI agent framework Pi, created by Mario Zechner, is not designed with local LLM users in mind. The user suggests Pi's focus on API users and its minimalist design, including a short system …
-
User seeks clarity on MTP and QTA quantization methods for Gemma 4
A user on the r/LocalLLaMA subreddit is seeking clarification on the relationship between MTP (likely referring to a model quantization method) and QTA (another quantization-related term). They are confused by the rapid…
-
User seeks NVFP4 quantization guidance for llama.cpp
A user on the r/LocalLLaMA subreddit is seeking guidance on how to utilize NVFP4 quantization with the llama.cpp framework. They are particularly interested in converting NVFP4 safetensors to the GGUF format and whether…
-
User finds Qwen3.6 35B model capable for local AI tasks
A user shared their experience running the Qwen3.6 35B-A3B model locally on a laptop, finding it capable enough for personal tasks and brainstorming. This marks a significant shift for them, providing a "second brain" t…
-
Local LLM user questions RAM usage with Qwen 27B model
A user is experiencing unexpected RAM usage while running a large language model locally, despite expecting the context cache to be primarily handled by VRAM. They are using Qwen 27B with llama.cpp and a memory extensio…
-
Open-source tools simplify local LLM management with llama.cpp
Two developers have released open-source tools to simplify the use of llama.cpp, a popular framework for running large language models locally. One tool, llama-launcher, offers a point-and-click graphical interface for …
-
llama.cpp integrates Gemma 4 MTP for faster local model performance
The llama.cpp project has merged support for Gemma 4 MTP, a feature that enhances the speed and efficiency of local large language models. This integration allows users to leverage Gemma 4 with Quantization Aware Traini…
-
User seeks fix for Gemma 4 31B model repeating tokens
A user on the r/LocalLLaMA subreddit is seeking assistance with running the Gemma 4 31B QAT GGUF model. Despite successfully loading the main model and an MTP assistant head, the model consistently outputs repeated \u00…