llama.cpp
PulseAugur coverage of llama.cpp — every cluster mentioning llama.cpp across labs, papers, and developer communities, ranked by signal.
- 2026-06-08 research_milestone llama.cpp merged a pull request to optimize KV cache performance for the Gemma-4 model. source
- 2026-06-05 product_launch A SYCL backend has been ported to llama.cpp, offering performance improvements for Intel Arc GPUs. source
- 2026-05-30 product_launch llama.cpp released version b9438, adding custom CSS injection for web UI theming. source
- 2026-05-25 research_milestone A fix is expected for llama.cpp to address split mode tensor crashes. source
- 2026-05-25 product_launch A pull request was submitted to improve checkpoint creation and context handling in llama.cpp. source
- 2026-05-24 product_launch llama.cpp released version b9305 with pre-compiled binaries for multiple platforms. source
- 2026-05-17 research_milestone llama.cpp implements MTP optimizations and prompt decode improvements for faster local AI inference. source
- 2026-05-14 product_launch A performance-optimized fork of llama.cpp was released with new features. source
- 2026-05-12 product_launch llama.cpp project integrates llama-eval tool for model benchmarking. source
31 day(s) with sentiment data
-
llama.cpp releases b9445 with CI optimizations
The llama.cpp project released version b9445, which includes several changes to its continuous integration (CI) processes. These updates focus on optimizing build jobs by removing redundancies and improving efficiency a…
-
LLM VRAM Overflow: User Seeks Clarity on CPU vs. System Memory Optimization
A user on r/LocalLLaMA is seeking to understand how large language models, specifically the Unsloth Gemma 4 26B, utilize system memory when they exceed GPU VRAM capacity. They are experiencing performance issues and are…
-
User seeks to boost local LLM speed on high-end laptop
A user on the r/LocalLLaMA subreddit is seeking advice on how to improve the inference speed of their local large language model setup. Despite having a laptop with a powerful RTX 5070 Ti GPU (12GB VRAM), 32GB RAM, and …
-
Gemma4-26B beats Qwen3.6-35B in speed despite slower token output
A user compared the performance of Qwen3.6-35B and Gemma4-26B on a Radeon 7900 XTX GPU, finding that Gemma4-26B was approximately 20% faster in end-to-end task completion despite Qwen3.6-35B having a significantly faste…
-
llama.cpp RDNA3: Flash Attention cuts KV VRAM with packed 8-bit K
A new method for llama.cpp on RDNA3 GPUs significantly reduces KV cache VRAM usage by packing K values into 8-bit integers, which are then processed by the GPU's native `sudot4` instruction. This approach offers a VRAM …
-
User adds reasoning toggle to QWEN3.6 web chat
A user has developed a browser extension script for Tampermonkey that adds a "think" toggle button to the llama.cpp web chat interface. This functionality allows users to enable or disable the reasoning capabilities of …
-
Windows vs. Linux: No Speed Difference for llama.cpp MoE Models
A user tested the performance of llama.cpp on Windows 11 and Linux, finding no significant speed difference for medium to large Mixture of Experts (MoE) models. The tests involved specific hardware configurations and de…
-
Mudler releases Qwen3.6-35B model with Claude 4.7 Opus reasoning
A new quantized model, Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF, has been released by mudler. This model is based on the APEX (Adaptive Precision for Expert Models) quantization technique and in…
-
llama.cpp adds custom CSS theming to its web UI
The llama.cpp project has released version b9438, introducing a new feature that allows for custom CSS injection into its web UI. This update enables users to theme the interface by providing custom CSS through a config…
-
MacBook M5 vs RTX 4060 for Local AI Workflows
A user is seeking advice on whether a future MacBook M5 with 16GB, 24GB, or 32GB of unified memory would be a worthwhile addition to their existing setup. Their current machine features an RTX 4060 laptop GPU with 8GB o…
-
Run LLMs Locally with OpenAI-Compatible API
This guide demonstrates how to set up a large language model locally, making it accessible via an OpenAI-compatible API endpoint. The process involves using Ollama on an Apple Silicon Mac to serve models like `gpt-oss:2…
-
User seeks guidance on STT-LLM-TTS pipeline integration
A user on the r/LocalLLaMA subreddit is seeking guidance on building a pipeline that integrates speech-to-text (STT), a large language model (LLM), and text-to-speech (TTS). They are currently running Qwen 3.6 27B with …
-
Mini PC User Questions AI Performance of MINISFORUM UM790 Pro
A user on the r/LocalLLaMA subreddit is inquiring about the performance of the MINISFORUM UM790 Pro mini PC for running AI models like llama.cpp and vLLM. They reference a claim that this $351 device is a notable option…
-
llama.cpp b9426 fixes iGPU selection with RPC devices
The llama.cpp project has released version b9426, addressing an issue where integrated GPUs were incorrectly skipped when RPC devices were present. This fix ensures that local iGPUs are not overlooked, preventing potent…
-
LLaMA.cpp users seek VRAM optimization beyond tensor-split
A user on the r/LocalLLaMA subreddit is seeking more efficient methods for optimizing VRAM usage with llama.cpp, particularly for Mixture of Experts (MoE) models across multiple GPUs. They currently rely on manual adjus…
-
MTP boosts Gemma 4 and Qwen 3.6 inference speed by up to 3.34x
A user benchmarked Multi-Token Prediction (MTP) on Gemma 4 31B and Qwen 3.6 27B models using vLLM and llama.cpp, achieving up to a 3.34x inference speedup. The tests, conducted on an RTX 6000 PRO GPU, revealed that vLLM…
-
llama.cpp launches unified website and binary
The llama.cpp project has launched a new website at llama.app, consolidating its resources and providing a unified binary for its tools. This initiative aims to streamline access and usability for users working with the…
-
llama.cpp B9406 fixes MTP crash with MoE vision models
The llama.cpp project has released version B9406, which includes a fix for a crash related to MTP (multimodal processing) with MoE (mixture of experts) models and vision capabilities. This specific issue affected users …
-
Unsloth releases optimized Gemma 4 models for local use
Unsloth has released several quantized versions of the Gemma 4 model, optimized for efficient local execution. These models, including `gemma-4-12B-it-qat-GGUF` and `gemma-4-12b-it-GGUF`, are available on Hugging Face. …
-
llama.cpp PR optimizes VRAM usage with f16 mask
A pull request for the llama.cpp project introduces an f16 mask for FA (likely referring to Flash Attention or a similar optimization) to reduce VRAM usage. This change allows users to download and run larger models by …