GGUF
PulseAugur coverage of GGUF — every cluster mentioning GGUF across labs, papers, and developer communities, ranked by signal.
7 天有情绪数据
-
Node.js 教程仓库通过本地 AI 代理教授模型上下文协议
一个新的教程仓库“MCP from Scratch”已经发布,它提供了理解模型上下文协议(MCP)的循序渐进的指南。该项目专注于使用纯 Node.js 构建 MCP 服务器,并集成本地 GGUF 模型推理。最终将实现一个利用 MCP 工具的自定义代理循环,并提供了一个可选的 LangChain 示例。
-
新的Windows应用程序SEELS通过用户更正实现本地LLM训练
一款名为SEELS的新Windows桌面应用程序已发布,该应用程序专为运行本地大型语言模型(LLM)而设计。其核心功能允许用户更正模型响应,并使用这些更正来训练自定义LoRA适配器,从而有效地个性化LLM。该应用程序还包括语音模式(支持本地STT/TTS)、硬件仪表板等功能,并支持GGUF模型,未来还将推出更高级的功能。
-
llama.cpp 新增原生工具,Qwen 发布 35B GGUF 模型
llama.cpp 项目已将其服务器直接集成了包括 shell 命令执行和文件编辑在内的原生工具,使本地大型语言模型能够执行操作和自动化任务。这一进展有助于创建更多能够完全在本地硬件上运行的自主代理。此外,一个拥有 350 亿参数的新 Qwen 模型 Qwen3.6-35B-A3B 已以 GGUF 格式发布,针对消费级硬件上的高效本地推理进行了优化。
-
Gemma4 Apex 量化提升速度,Ollama 缩减上下文,Llama3 在逻辑推理方面遇到困难
近期本地 LLM 部署的进展包括:Gemma4 的新 Apex 量化技术,在大型上下文窗口下实现了高令牌速率;以及一项使用 Memgraph 将 Ollama 的提示上下文减少近 90% 的工作流程。此外,基准测试表明,TinyLlama 和 Llama3.2:3b 等小型模型在布尔逻辑任务方面存在困难,准确率约为 50%。
-
LM Studio 添加 MTP 推测解码,加速本地 LLM 推理
LM Studio 已更新至 0.4.14 Build 2 (Beta) 版本,集成了 MTP 推测解码以加速本地大型语言模型推理。此功能通过同时预测多个 token 来实现更快的文本生成,使本地 AI 交互更加流畅。此外,Qwen 3.6 35B 模型的新 GGUF 量化版本已发布,并提供了 MTP 和 NTP 在不同硬件上性能的基准测试对比,为用户优化本地 LLM 部署提供数据。
-
llama.cpp boosts local AI with MTP and new coding model
The llama.cpp project has implemented significant optimizations, including Multi-Tensor Processing (MTP) support and prompt decode improvements, to enhance local AI inference performance. These advancements allow for fa…
-
Q4_K_M recommended for local LLM quantization, balancing quality and VRAM
The article recommends Q4_K_M quantization as the best balance of quality and VRAM efficiency for most local LLM users, preserving 93-96% of FP16 quality. For users with more VRAM, Q5_K_M offers a noticeable improvement…
-
Ollama guide shows how to run local GGUF models with GPU
This guide details how to run local GGUF models with Ollama, enabling GPU acceleration for improved performance. It covers installation, GPU detection for NVIDIA and AMD systems, and setting up a Modelfile for custom mo…
-
llama.cpp 增加评估工具;MagicQuant v2.0 提供混合 GGUF 量化
llama.cpp 项目引入了 llama-eval,一个用于根据标准数据集对本地语言模型进行基准测试的新工具。同时,MagicQuant v2.0 发布了先进的混合 GGUF 量化技术,并与 Unsloth 集成以优化模型压缩。此外,一个名为 Needle 的新 26M 参数开源模型已发布,专为在消费级硬件上进行高效的本地工具调用而设计。
-
ExLlamaV3, Unsloth Qwen, and Phi3 agent see major local AI updates
This week's local AI news highlights significant updates to the ExLlamaV3 inference library, enhancing efficiency for running quantized Llama models on consumer GPUs. Additionally, new GGUF-quantized versions of Qwen 3.…
-
Local AI tools boost LLM speeds with new prediction and decoding techniques
Recent updates in the local AI community are enhancing inference speeds and providing practical benchmarks for open-weight models. The llama.cpp project now supports Multi-Token Prediction (MTP), which has shown a 40% s…
-
llama.cpp adds Sparse MoE support, Qwen3.6 GGUF, and WebWorld models for local AI
The llama.cpp project has been updated to support Xiaomi's MiMo-V2.5 Sparse MoE model, allowing local inference of large, parameter-efficient models. Additionally, a new uncensored Qwen3.6 27B model is now available in …
-
Ollama platform vulnerable to memory leaks via crafted GGUF files
A critical vulnerability, identified as CVE-2026-5757, has been discovered in the Ollama platform, potentially leading to memory leaks. The flaw is triggered by a specially crafted GGUF file. Security researcher Jeremy …
-
IBM releases Apache 2.0 licensed Granite 4.1 LLMs in 3B, 8B, 30B sizes
IBM has released its Granite 4.1 family of large language models, available in 3B, 8B, and 30B parameter sizes under an Apache 2.0 license. Unsloth has further provided quantized GGUF variants of the 3B model, offering …
-
RadLite微调小型LLM,用于CPU可部署的放射学AI
研究人员开发了RadLite,一种用于放射学任务的30-40亿参数小型语言模型(SLM)微调方法。该方法利用Qwen2.5-3B-Instruct和Qwen3-4B等模型的LoRA微调,显著提高了九种不同放射学应用的性能。所得模型足够小,可以量化并在消费级CPU上部署,为资源受限的临床环境提供了实用的解决方案。
-
SGLang AI inference server hit with critical CVE-2026-5760 vulnerability
A critical security vulnerability (CVE-2026-5760) with a severity score of 9.8 has been identified in SGLang, an AI inference server. The issue arises from a poisoned GGUF model file containing a chat-template that SGLa…
-
Stateful Transformers boost streaming inference; Intel releases AutoRound quantization toolkit
A new paper introduces a stateful transformer inference engine that significantly speeds up processing for streaming data by maintaining a persistent KV cache. This approach allows for query latency that is independent …
-
Quantized Qwen3.6-27B model achieves 100k context on 16GB VRAM
A user on Reddit's r/LocalLLaMA has detailed a method for running the Qwen3.6-27B model on a system with 16GB of VRAM, achieving a context length of 100,000 tokens. The process involves creating a custom GGUF quantizati…
-
Qwen3.6-27B model offers flagship coding performance in a smaller package
Qwen has released Qwen3.6-27B, an open-weight model that reportedly matches flagship-level coding performance. This new model significantly outperforms its predecessor, Qwen3.5-397B-A17B, while being substantially small…