PulseAugur
EN
LIVE 22:11:44
ENTITY llama.cpp

llama.cpp

PulseAugur coverage of llama.cpp — every cluster mentioning llama.cpp across labs, papers, and developer communities, ranked by signal.

Show in brief
Total · 30d
382
382 over 90d
Releases · 30d
0
0 over 90d
Papers · 30d
14
14 over 90d
TIER MIX · 90D
TOPICS
RELATIONSHIPS
TIMELINE
  1. 2026-06-25 product_launch The llama.cpp project released version b9802 with pre-compiled binaries for multiple operating systems and hardware. source
  2. 2026-06-25 product_launch llama.cpp version b9788 introduces tensor split support for Intel GPUs. source
  3. 2026-06-17 product_launch llama.cpp has added API support for on-demand model management, including downloading and unloading models. source
  4. 2026-06-08 research_milestone llama.cpp merged a pull request to optimize KV cache performance for the Gemma-4 model. source
  5. 2026-06-05 product_launch A SYCL backend has been ported to llama.cpp, offering performance improvements for Intel Arc GPUs. source
  6. 2026-05-30 product_launch llama.cpp released version b9438, adding custom CSS injection for web UI theming. source
  7. 2026-05-25 research_milestone A fix is expected for llama.cpp to address split mode tensor crashes. source
  8. 2026-05-25 product_launch A pull request was submitted to improve checkpoint creation and context handling in llama.cpp. source
  9. 2026-05-24 product_launch llama.cpp released version b9305 with pre-compiled binaries for multiple platforms. source
  10. 2026-05-17 research_milestone llama.cpp implements MTP optimizations and prompt decode improvements for faster local AI inference. source
  11. 2026-05-14 product_launch A performance-optimized fork of llama.cpp was released with new features. source
  12. 2026-05-12 product_launch llama.cpp project integrates llama-eval tool for model benchmarking. source
SENTIMENT · 30D

30 day(s) with sentiment data

RECENT · PAGE 1/10 · 200 TOTAL
  1. TOOL · CL_114852 ·

    Ornith-1.0-35B GGUF model updated with speculative-decode graft

    A new version of the Ornith-1.0-35B model, specifically the GGUF format, has been updated with a native Multi Token Prediction (MTP) speculative-decode graft. This update enhances single-stream decode speeds by 1.3-1.35…

  2. TOOL · CL_114871 ·

    User develops script to analyze llama.cpp memory usage

    A user has developed a script to monitor and analyze the memory usage of llama.cpp, a popular inference engine for large language models. This script parses the verbose output of llama.cpp to provide a clear summary of …

  3. TOOL · CL_114641 ·

    llama.cpp integrates DFlash quantization for local LLM efficiency

    The llama.cpp project has integrated support for DFlash, a new quantization method. This integration, merged via a pull request, aims to improve the efficiency and performance of running large language models locally. T…

  4. TOOL · CL_114584 ·

    Local LLM optimization: Step-3.7-Flash gains 2.4x speed, MTP breaks vision

    A developer has optimized the Step-3.7-Flash (198B-A11B vision MoE) model for local hardware, achieving significant performance gains. By ensuring the model's largest quantization (IQ3_XXS) fits entirely within the 96GB…

  5. TOOL · CL_114176 ·

    Liquid AI ships tiny LFM2.5-230M for on-device agent tasks

    Liquid AI has released LFM2.5-230M, its smallest model to date, designed for on-device inference on edge hardware like phones and robots. This 230-million-parameter model excels at data extraction and tool use, outperfo…

  6. TOOL · CL_113871 ·

    SpectralQuant method recovers 96.5% of BF16 performance gap in Qwen3.5 model

    Spectral Labs has developed a new quantization method called SpectralQuant, which aims to improve the performance of smaller model footprints. Their initial release, a Qwen3.5 0.8B model quantized to Q4_K_M, reportedly …

  7. TOOL · CL_111954 ·

    Ornith 1.0 models explained: Dense vs MoE and format/precision details

    A guide has been released to explain the terminology and concepts behind the new Ornith 1.0 models. The guide clarifies the difference between Dense and Mixture of Experts (MoE) architectures, noting that MoE models act…

  8. TOOL · CL_111217 ·

    llama.cpp releases add MiniCPM5 support and performance enhancements

    The llama.cpp project has released several updates, including version b9833 which adds support for the MiniCPM5 model with autoparser for XML tool calls and grammar fixes. Other releases focus on improving performance a…

  9. TOOL · CL_111065 ·

    Developer creates C#-native Ollama replacement for LLM inference

    A developer has created a new inference server for Large Language Models (LLMs) entirely in C# using SpawnDev.ILGPU.ML. This server is designed to be a drop-in replacement for Ollama, supporting Ollama's API and reading…

  10. TOOL · CL_111032 ·

    Graphics card prices surge, impacting local LLM setups

    The price of graphics cards suitable for running local large language models has significantly increased, prompting a user to seek advice on purchasing a second card. The user notes that their existing AMD RX 7900 XTX, …

  11. TOOL · CL_110784 ·

    llama.cpp adds tensor split support for Intel GPUs, fixing model issues

    A recent release of llama.cpp, version b9788, introduces support for tensor splitting on Intel GPUs. This feature aims to resolve issues previously encountered when using tensor split mode, particularly with models like…

  12. TOOL · CL_110435 ·

    New sampler-verifier system boosts small LLM coding performance

    A new research paper introduces a sampler and verifier system that significantly enhances the coding performance of small language models. This approach can potentially bring a 0.5 billion parameter model up to the leve…

  13. COMMENTARY · CL_110103 ·

    MTP feature degrades output quality for Qwen 3.6 and Gemma 4 models

    A user on r/LocalLLaMA reported a significant decrease in output quality when using the MTP (Multi-Turn Processing) feature with Qwen 3.6 and Gemma 4 models. Despite MTP offering higher token generation speeds, the user…

  14. TOOL · CL_110110 ·

    User seeks help testing MTP for GLM-4.7-Flash model

    A user is seeking assistance in testing Multi Token Prediction (MTP) for the GLM-4.7-Flash model within the llama.cpp framework. They have developed a version of the model with MTP enabled and are looking for community …

  15. TOOL · CL_109811 ·

    New App Enables Local, Offline Chat With Documents

    Off Grid AI Desktop is a new, free, open-source application designed to enable users to chat with their documents locally on their personal computers. The tool handles the entire process, including embedding, vector sto…

  16. TOOL · CL_109812 ·

    Run Alibaba's Qwen LLM locally and offline with Off Grid AI Desktop

    Off Grid AI Desktop is a new, free, open-source application that allows users to run Alibaba Group's Qwen large language models locally on their personal computers. This enables offline, private AI interactions, with th…

  17. TOOL · CL_109813 ·

    Run Google's Gemma LLM Locally with New Open-Source App

    A new open-source application called Off Grid AI Desktop allows users to run Google's Gemma language models locally on their Mac or Windows computers. This approach prioritizes user privacy by keeping all prompts and da…

  18. TOOL · CL_109816 ·

    Run LLMs locally on Windows and Mac with Off Grid AI Desktop

    Off Grid AI Desktop is a new, free, open-source application that allows users to run large language models locally on their Windows PCs or Macs. The software supports offline use, eliminating the need for subscriptions …

  19. TOOL · CL_112135 ·

    Unsloth releases Qwen-AgentWorld-35B model with broad integration support

    The unsloth/Qwen-AgentWorld-35B-A3B-GGUF model is now available on Hugging Face, offering users instructions for integration with various libraries and inference providers. The model can be utilized with tools such as T…

  20. TOOL · CL_110111 ·

    GLM-5.2 speculative decode runs on 4x DGX GB10 cluster

    A user successfully implemented GLM-5.2 with MTP speculative decoding on a 4x DGX GB10 cluster, achieving approximately 9.4 tokens/second. This involved reconstructing missing build modifications from public kernels and…