llama.cpp
PulseAugur coverage of llama.cpp — every cluster mentioning llama.cpp across labs, papers, and developer communities, ranked by signal.
- 2026-06-08 research_milestone llama.cpp merged a pull request to optimize KV cache performance for the Gemma-4 model. source
- 2026-06-05 product_launch A SYCL backend has been ported to llama.cpp, offering performance improvements for Intel Arc GPUs. source
- 2026-05-30 product_launch llama.cpp released version b9438, adding custom CSS injection for web UI theming. source
- 2026-05-25 research_milestone A fix is expected for llama.cpp to address split mode tensor crashes. source
- 2026-05-25 product_launch A pull request was submitted to improve checkpoint creation and context handling in llama.cpp. source
- 2026-05-24 product_launch llama.cpp released version b9305 with pre-compiled binaries for multiple platforms. source
- 2026-05-17 research_milestone llama.cpp implements MTP optimizations and prompt decode improvements for faster local AI inference. source
- 2026-05-14 product_launch A performance-optimized fork of llama.cpp was released with new features. source
- 2026-05-12 product_launch llama.cpp project integrates llama-eval tool for model benchmarking. source
30 day(s) with sentiment data
-
MacBook users get guide for stable local AI model performance
A Reddit user shared a detailed guide for optimizing local AI model performance on MacBooks, particularly for the Qwen3.6 35b A3B model. The user experienced significant issues with crashes and slow performance before i…
-
llama.cpp adds Gemma4ForCausalLM architecture support
The llama.cpp project has released version b9341, which includes support for the Gemma4ForCausalLM architecture. This update allows for greater compatibility and integration with Google's Gemma models within the llama.c…
-
Macs struggle with LLM agent prompt processing, not just token speed
A discussion on Reddit's r/openclaw suggests that for agent-style workloads, prompt processing speed is a more critical bottleneck than tokens per second, especially when running models locally on Macs. While Macs with …
-
Lawyer builds 16-GPU AI cluster for legal drafting with MoE models
A lawyer has updated their local AI setup for legal drafting, now featuring twelve V100 SXM2 32GB GPUs and an additional box with four RTX 3090s and two V100 PCIe cards. They switched from vLLM to llama.cpp for running …
-
Local AI Tools Improve: llama.cpp Fix, NuExtract3 VLM, Qwen3.6 Speed
This week's AI news includes a critical fix for checkpoint creation in the llama.cpp server, enhancing its reliability for long-running agentic tasks. Additionally, NuExtract3 has been released as an open-weight 4B Visi…
-
llama.cpp adds CUDA FWHT for faster KV cache quantization
A pull request to the llama.cpp project introduces a CUDA implementation of the Fast Walsh-Hadamard Transform (FWHT). This optimization, developed by user am17an, aims to speed up operations when quantizing the key-valu…
-
Llama.cpp split mode tensor fix to resolve multi-GPU crashes
A fix is reportedly incoming for the llama.cpp project to address crashes related to split mode tensor operations. This issue has been causing instability, particularly for users employing multiple GPUs, with tests show…
-
RTX 3060 users seek best coding LLM and setup
A user on the r/LocalLLaMA subreddit is seeking recommendations for the best coding-focused large language model that can run on hardware with 12GB of VRAM, specifically an RTX 3060. The user is also inquiring about opt…
-
Developer runs Anthropic Code locally for free using Qwen model
A developer successfully ran Anthropic's Claude Code locally for four hours, processing 7 million tokens without incurring API costs. This was achieved by routing Claude Code's requests through LiteLLM to a local Qwen3.…
-
Old Mac Pro repurposed for local LLM tasks with new drivers
An old Mac Pro, originally costing nearly £10,000, is being repurposed for local LLM work thanks to new Linux drivers that enable its D700 GPUs. The machine, equipped with 64GB of RAM and 24 cores, can now run models vi…
-
llama.cpp users report persistent out-of-memory errors
A user on Reddit's r/LocalLLaMA subreddit is experiencing a persistent out-of-memory (OOM) issue with the llama.cpp software. The problem causes the process to consume increasing amounts of system RAM over 20-40 minutes…
-
llama.cpp update targets faster agentic coding by optimizing context handling
A pull request for the llama.cpp project aims to improve the responsiveness of agentic coding workflows. The proposed changes address issues where context rewriting by tools or models could force full prompt reprocessin…
-
llama.cpp releases b9309 with integer overflow fixes
The llama.cpp project has released version b9309, which includes fixes for integer overflow issues. This release is part of ongoing development and maintenance for the C/C++ implementation of Llama models.
-
LLaMA user sees doubled inference speed with Qwen model after CPU parameter change
A user on Reddit's r/LocalLLaMA subreddit is seeking assistance understanding unexpected performance gains when running the Qwen3.6-35B-A3B-UD-Q4_K_XL model. They observed a doubling of inference speed, from 17 to 34 to…
-
hipEngine offers faster Qwen 3.6 LLM inference on AMD RDNA3 GPUs
A new open-source inference engine called hipEngine has been developed for AMD's RDNA3 GPUs, enabling faster native inference of the Qwen 3.6 large language model. The engine, written in Python with a HIP/C++ core, util…
-
Liquid AI ships LFM2.5-8B-A1B on-device MoE model
Liquid AI has released LFM2.5-8B-A1B, a new on-device Mixture-of-Experts (MoE) model designed for complex tasks and tool chaining. This model features 8.3 billion total parameters but activates only 1.5 billion per toke…
-
llama.cpp adds native tools, Qwen releases 35B GGUF model
The llama.cpp project has integrated native tools, including shell command execution and file editing, directly into its server, enabling local large language models to perform actions and automate tasks. This advanceme…
-
LocalLLaMA user seeks harness for multi-agent Qwen 3.6 setup
A user on Reddit's r/LocalLLaMA subreddit is seeking recommendations for an open-source harness to manage multiple local AI agents. They are currently using Qwen 3.5/3.6 27B models on a Windows 10 machine with an RTX 30…
-
LocalLLaMA user seeks VRAM optimization for smaller models
A user on the r/LocalLLaMA subreddit is seeking assistance with optimizing their GPU VRAM usage for running smaller language models. Despite successfully running larger models like Gemma4 26B and Qwen 3.6 35B MoEs, they…
-
Developer runs LLMs on $50 AMD RX 580 GPU using Vulkan
A developer demonstrated running large language models and image generation software on an older AMD RX 580 GPU with 8GB of VRAM, a feat previously thought impossible for such hardware. By leveraging the Vulkan backend …