GGUF
PulseAugur coverage of GGUF — every cluster mentioning GGUF across labs, papers, and developer communities, ranked by signal.
2 day(s) with sentiment data
-
llama.cpp adds eval tool; MagicQuant v2.0 offers hybrid GGUF quants
The llama.cpp project has introduced llama-eval, a new tool for benchmarking local language models against standard datasets. Concurrently, MagicQuant v2.0 has released advanced hybrid GGUF quantization techniques, inte…
-
ExLlamaV3, Unsloth Qwen, and Phi3 agent see major local AI updates
This week's local AI news highlights significant updates to the ExLlamaV3 inference library, enhancing efficiency for running quantized Llama models on consumer GPUs. Additionally, new GGUF-quantized versions of Qwen 3.…
-
Local AI tools boost LLM speeds with new prediction and decoding techniques
Recent updates in the local AI community are enhancing inference speeds and providing practical benchmarks for open-weight models. The llama.cpp project now supports Multi-Token Prediction (MTP), which has shown a 40% s…
-
llama.cpp adds Sparse MoE support, Qwen3.6 GGUF, and WebWorld models for local AI
The llama.cpp project has been updated to support Xiaomi's MiMo-V2.5 Sparse MoE model, allowing local inference of large, parameter-efficient models. Additionally, a new uncensored Qwen3.6 27B model is now available in …
-
Ollama platform vulnerable to memory leaks via crafted GGUF files
A critical vulnerability, identified as CVE-2026-5757, has been discovered in the Ollama platform, potentially leading to memory leaks. The flaw is triggered by a specially crafted GGUF file. Security researcher Jeremy …
-
IBM releases Apache 2.0 licensed Granite 4.1 LLMs in 3B, 8B, 30B sizes
IBM has released its Granite 4.1 family of large language models, available in 3B, 8B, and 30B parameter sizes under an Apache 2.0 license. Unsloth has further provided quantized GGUF variants of the 3B model, offering …
-
RadLite fine-tunes small LLMs for CPU-deployable radiology AI
Researchers have developed RadLite, a method for fine-tuning small language models (SLMs) with 3-4 billion parameters for radiology tasks. This approach, utilizing LoRA fine-tuning on models like Qwen2.5-3B-Instruct and…
-
SGLang AI inference server hit with critical CVE-2026-5760 vulnerability
A critical security vulnerability (CVE-2026-5760) with a severity score of 9.8 has been identified in SGLang, an AI inference server. The issue arises from a poisoned GGUF model file containing a chat-template that SGLa…
-
Stateful Transformers boost streaming inference; Intel releases AutoRound quantization toolkit
A new paper introduces a stateful transformer inference engine that significantly speeds up processing for streaming data by maintaining a persistent KV cache. This approach allows for query latency that is independent …
-
Quantized Qwen3.6-27B model achieves 100k context on 16GB VRAM
A user on Reddit's r/LocalLLaMA has detailed a method for running the Qwen3.6-27B model on a system with 16GB of VRAM, achieving a context length of 100,000 tokens. The process involves creating a custom GGUF quantizati…
-
Qwen3.6-27B model offers flagship coding performance in a smaller package
Qwen has released Qwen3.6-27B, an open-weight model that reportedly matches flagship-level coding performance. This new model significantly outperforms its predecessor, Qwen3.5-397B-A17B, while being substantially small…