PulseAugur
EN
LIVE 16:48:04
ENTITY llama.cpp

llama.cpp

PulseAugur coverage of llama.cpp — every cluster mentioning llama.cpp across labs, papers, and developer communities, ranked by signal.

Show in brief
Total · 30d
286
286 over 90d
Releases · 30d
0
0 over 90d
Papers · 30d
13
13 over 90d
TIER MIX · 90D
TOPICS
RELATIONSHIPS
TIMELINE
  1. 2026-06-08 research_milestone llama.cpp merged a pull request to optimize KV cache performance for the Gemma-4 model. source
  2. 2026-06-05 product_launch A SYCL backend has been ported to llama.cpp, offering performance improvements for Intel Arc GPUs. source
  3. 2026-05-30 product_launch llama.cpp released version b9438, adding custom CSS injection for web UI theming. source
  4. 2026-05-25 research_milestone A fix is expected for llama.cpp to address split mode tensor crashes. source
  5. 2026-05-25 product_launch A pull request was submitted to improve checkpoint creation and context handling in llama.cpp. source
  6. 2026-05-24 product_launch llama.cpp released version b9305 with pre-compiled binaries for multiple platforms. source
  7. 2026-05-17 research_milestone llama.cpp implements MTP optimizations and prompt decode improvements for faster local AI inference. source
  8. 2026-05-14 product_launch A performance-optimized fork of llama.cpp was released with new features. source
  9. 2026-05-12 product_launch llama.cpp project integrates llama-eval tool for model benchmarking. source
SENTIMENT · 30D

31 day(s) with sentiment data

RECENT · PAGE 6/10 · 200 TOTAL
  1. TOOL · CL_69470 ·

    llama.cpp users share benchmarks for optimized Qwen3.6/3.5-MTP models

    The llama.cpp project has seen significant optimizations and fixes for the Qwen3.6/3.5-MTP models, with recent merges enhancing performance. Users are encouraged to share their benchmarks using the latest version, provi…

  2. FRONTIER RELEASE · CL_69458 ·

    Google DeepMind releases multimodal Gemma 4 12B for laptops

    Google DeepMind has released Gemma 4 12B, an open-source multimodal AI model capable of processing text, images, audio, and video natively. This model is designed to run on consumer laptops with as little as 16 GB of RA…

  3. TOOL · CL_69399 ·

    User seeks audio/vision integration help for llama.cpp

    A user on the r/LocalLLaMA subreddit is seeking guidance on integrating audio and vision capabilities into the llama.cpp framework. They are using the b9494 release and have encountered issues where the command-line int…

  4. TOOL · CL_69337 ·

    llama.cpp PR optimizes Qwen35 inference speed

    A pull request has been submitted to the llama.cpp repository to optimize the Qwen35 model. The proposed change involves using a post-norm hidden state for the MTP (Multi-Turn Prompting) process. This modification aims …

  5. TOOL · CL_69109 ·

    Google's Gemma 4 Unified Model Hints at Transformer-less Architecture

    A new model type, "Gemma 4 Unified," appears to be in development by Google, as indicated by code merged into the llama.cpp repository. The implementation details suggest a transformer-less vision tower architecture, a …

  6. TOOL · CL_69045 ·

    llama.cpp adds Mermaid diagram generation with interactive preview

    A pull request has been submitted to the llama.cpp project, introducing the capability to generate Mermaid diagrams within the chat interface. This feature includes an interactive preview, allowing users to visualize an…

  7. TOOL · CL_69048 ·

    Tauri v2 desktop app shells local LLMs via Ollama, llama.cpp

    A developer has created a desktop chat application using Tauri v2, designed to interface with local large language models. This application supports various backends, including Ollama, llama.cpp, and any endpoint compat…

  8. TOOL · CL_68787 ·

    Engineer runs 35B LLM on old GPU, surprising many

    A software engineer demonstrated that a 35-billion parameter language model can run effectively on older, consumer-grade GPUs. This was achieved through advanced optimization techniques like quantization, which reduces …

  9. TOOL · CL_68939 ·

    llama.cpp tensor split mode causes CUDA error with Qwen model

    A user encountered a CUDA error when attempting to load a Qwen-3.6-27b model with tensor split mode enabled in the latest version of llama.cpp. The error message indicates that the `llama_params_fit` function is not imp…

  10. TOOL · CL_68680 ·

    Mellum & Granite embedding models now available on llama.cpp

    The Mellum and Granite embedding models are now compatible with the llama.cpp framework. This integration allows users to leverage these models for local inference and development. The compatibility was achieved through…

  11. TOOL · CL_68678 ·

    llama.cpp build b9455 achieves 70+ tokens/sec on Qwen3.6-27B

    A user on Reddit's r/LocalLLaMA community shared impressive performance gains using a new build of llama.cpp, specifically version b9455. This updated version, when combined with tensor splitting across two RTX 3090 GPU…

  12. TOOL · CL_68067 ·

    Ollama releases rapid updates, adding CLI, Qwen integration, and fixing Gemma crash

    Ollama has released several updates in quick succession, with versions v0.30.2 through v0.30.5 rolling out. These updates include the addition of Cline CLI support, improved logging for troubleshooting, and integration …

  13. MEME · CL_68046 ·

    User seeks llama.cpp commands for NVFP4 model quantization

    A user on the r/LocalLLaMA subreddit is seeking guidance on how to quantize a large language model to the NVFP4 format using the llama.cpp tool. They are specifically interested in running the MiniMax M2.7 model but can…

  14. COMMENTARY · CL_67983 ·

    Macs vs. NVIDIA GPUs: Choosing the Right Hardware for Local LLMs

    For running large language models locally, Apple Silicon Macs and NVIDIA GPUs offer distinct advantages. Macs excel at inference for larger models due to their unified memory architecture, allowing them to handle models…

  15. TOOL · CL_67826 ·

    Ollama v0.30.1 fixes SSE parsing for llama.cpp streams

    Ollama has released version 0.30.1, which addresses an issue where the system incorrectly parsed non-data Server-Sent Events (SSE) as JSON. This update specifically targets the llama.cpp integration, preventing the syst…

  16. RESEARCH · CL_67687 ·

    Hugging Face Spotlights Open-Source AI Models and llama.cpp Features

    Hugging Face is highlighting two open-source AI projects. The first is Codex, which is open-sourcing its AI models, with details available on Hugging Face's blog about AI skills training. The second project featured is …

  17. MEME · CL_67635 ·

    NixOS flake build fixed, enabling llama.cpp compilation

    A user on Reddit expressed gratitude for a fix to the NixOS flake build, specifically noting that it now works for building llama.cpp. The post serves as a thank you to the individual or team responsible for the resolution.

  18. TOOL · CL_67339 ·

    Gemma 4 E4B achieves 2.4x speedup with LiteRT engine

    A user has achieved a 2.4x speedup in text generation using Google's Gemma 4 E4B model by employing the LiteRT engine with multi-token prediction (MTP). This optimization significantly outperforms the standard Q4 GGUF q…

  19. TOOL · CL_67162 ·

    StepFun 3.5 MTP model integrated into llama.cpp

    A new model called StepFun 3.5 MTP has been introduced via a pull request to the llama.cpp project. This model appears to be a successor to Gemma MTP, with its integration into llama.cpp being a key development.

  20. TOOL · CL_66953 ·

    llama.cpp adds user control over AI reasoning effort

    A new pull request for the llama.cpp project introduces a "Thinking mode" toggle, allowing users to enable, disable, or limit the reasoning effort of the AI. This feature aims to provide more control over the model's co…