PulseAugur
EN
LIVE 18:29:41
ENTITY llama.cpp

llama.cpp

PulseAugur coverage of llama.cpp — every cluster mentioning llama.cpp across labs, papers, and developer communities, ranked by signal.

Show in brief
Total · 30d
287
287 over 90d
Releases · 30d
0
0 over 90d
Papers · 30d
13
13 over 90d
TIER MIX · 90D
TOPICS
RELATIONSHIPS
TIMELINE
  1. 2026-06-08 research_milestone llama.cpp merged a pull request to optimize KV cache performance for the Gemma-4 model. source
  2. 2026-06-05 product_launch A SYCL backend has been ported to llama.cpp, offering performance improvements for Intel Arc GPUs. source
  3. 2026-05-30 product_launch llama.cpp released version b9438, adding custom CSS injection for web UI theming. source
  4. 2026-05-25 research_milestone A fix is expected for llama.cpp to address split mode tensor crashes. source
  5. 2026-05-25 product_launch A pull request was submitted to improve checkpoint creation and context handling in llama.cpp. source
  6. 2026-05-24 product_launch llama.cpp released version b9305 with pre-compiled binaries for multiple platforms. source
  7. 2026-05-17 research_milestone llama.cpp implements MTP optimizations and prompt decode improvements for faster local AI inference. source
  8. 2026-05-14 product_launch A performance-optimized fork of llama.cpp was released with new features. source
  9. 2026-05-12 product_launch llama.cpp project integrates llama-eval tool for model benchmarking. source
SENTIMENT · 30D

31 day(s) with sentiment data

RECENT · PAGE 7/10 · 200 TOTAL
  1. TOOL · CL_66953 ·

    llama.cpp adds user control over AI reasoning effort

    A new pull request for the llama.cpp project introduces a "Thinking mode" toggle, allowing users to enable, disable, or limit the reasoning effort of the AI. This feature aims to provide more control over the model's co…

  2. TOOL · CL_66918 ·

    Nous Research launches integrated AI agent framework and models

    Nous Research has released Hermes Agent, a Python-based AI agent framework, alongside their Hermes Models, which are fine-tuned for function calling. This integrated approach, where the model is trained on the agent's t…

  3. TOOL · CL_66818 ·

    User optimizes Qwen3.6-27B LLM to 73 tokens/sec with llama.cpp

    A user details how they optimized the Qwen3.6-27B large language model to achieve a generation speed of 73 tokens per second using the llama.cpp framework. The article focuses on specific parameters and settings that pr…

  4. TOOL · CL_66851 ·

    Jetson Orin Nano benchmarks 8 tiny LLMs across power modes

    A benchmark of eight small language models (135M to ~1B parameters) was conducted on a Jetson Orin Nano Super 8GB device. The tests explored four power modes (7W, 15W, 25W, MAXN) using the llama.cpp CUDA backend. The fi…

  5. COMMENTARY · CL_66564 ·

    LLM psychology explored through five probing questions

    A series of posts explores the psychology of Large Language Models (LLMs) by posing five key questions. These questions delve into the LLM's inner workings, capabilities, limitations, and potential biases, drawing paral…

  6. TOOL · CL_66427 ·

    llama.cpp adds support for Step3.7-Flash model

    A pull request has been submitted to the llama.cpp repository to add support for the Step3.7-Flash model. This integration aims to enable local execution of this particular AI model. The request also mentions ongoing wo…

  7. TOOL · CL_66426 ·

    Qwen 3.6-35B-A3B model achieves 977 tk/s on Intel Arc GPU

    A user has successfully run the Qwen 3.6-35B-A3B model on an Intel Arc B70 Pro GPU, achieving impressive performance metrics. The setup utilized llama.cpp with SYCL backend, yielding a prompt processing speed of 977 tok…

  8. TOOL · CL_65049 ·

    Intel Arc Pro B70 GPU benchmarks released for Llama.cpp

    Benchmarks for Intel's Arc Pro B70 GPU running Llama.cpp have been posted, showing performance metrics for the hardware with the Qwen model. The results indicate specific timings for the GPU's operation within the Llama…

  9. TOOL · CL_65009 ·

    MLX, LiteRT-LM, and CoreML benchmarked for iPhone LLM performance

    A recent benchmark tested four on-device LLM runtimes on an iPhone 17 Pro, comparing decode speed and memory usage. MLX emerged as the fastest for general-purpose models like Qwen 3.5 2B, while LiteRT-LM excelled specif…

  10. TOOL · CL_64952 ·

    Meta's AI data collection sparks employee backlash; NVIDIA boosts AI tool speeds

    Meta's AI training initiative, "MCI," is facing backlash from employees concerned about data collection and potential privacy violations. Concurrently, NVIDIA has announced significant performance improvements, doubling…

  11. TOOL · CL_64910 ·

    Claude code runs locally on MacBook, outperforming llama.cpp

    A user successfully ran Anthropic's Claude code on their MacBook using the vllm-mlx library. This setup significantly outperformed llama.cpp, achieving an 87% improvement in performance. The author expressed surprise at…

  12. TOOL · CL_64757 ·

    Odysseus launches as privacy-focused, self-hosted AI workspace

    Odysseus is a self-hosted AI workspace emphasizing local-first operation and user privacy. It integrates various functionalities including chat, agents, a cookbook for model management, deep research tools, model compar…

  13. SIGNIFICANT · CL_64680 ·

    FTC probes Microsoft AI/Azure; Ollama releases v0.30.0

    The US Federal Trade Commission (FTC) is investigating Microsoft for potential monopolistic practices within the AI and Azure cloud sectors. This probe, ongoing since 2024, could lead to legal action, echoing Microsoft'…

  14. COMMENTARY · CL_64561 ·

    LocalLLaMA users seek agentic browser use with local LLMs

    A user on the r/LocalLLaMA subreddit is seeking methods for enabling agentic browser use with local large language models. They are currently relying on cloud-based models for this functionality but are looking for alte…

  15. RESEARCH · CL_64527 ·

    JetBrains ships Mellum2, Heretic tool aids censorship removal, NVIDIA launches Cosmos 3

    JetBrains has released Mellum2, a 12-billion parameter Mixture-of-Experts model optimized for efficient local AI inference. Concurrently, a new tool called 'Heretic' has emerged on GitHub, designed to automatically remo…

  16. COMMENTARY · CL_64560 ·

    Qwen 3.6 27B model outperforms Gemini Pro in local testing

    A user shared their positive experience running the Qwen 3.6 27B model locally, finding it superior to Gemini Pro for complex research tasks. The model demonstrated impressive performance in analyzing official documenta…

  17. TOOL · CL_64402 ·

    llama.cpp merges KV cache fix for multi-GPU tensor operations

    The llama.cpp project has merged a significant fix (b9455) that resolves issues with the KV cache when using the --sm tensor flag on multi-GPU setups. This update, developed by Johannes Gaessler, ensures that shape info…

  18. TOOL · CL_63981 ·

    llama.cpp PR optimizes VRAM by limiting context outputs

    A pull request to the llama.cpp project aims to optimize VRAM usage by limiting the maximum output of `llama_context`. This change, building on a previous PR, reserves logits space only when necessary, potentially savin…

  19. RESEARCH · CL_63787 ·

    Mistral.rs boosts CUDA inference speed; non-CUDA status debated

    The mistral.rs project has released version 0.8.2, significantly improving CUDA inference speeds by up to 2.8 times compared to llama.cpp on various NVIDIA GPUs. This update focuses on optimizing throughput for models l…

  20. TOOL · CL_62444 ·

    Local Gemma models achieve 2.5x speedup with LiteRT endpoint

    A user has successfully integrated Google's Gemma 2B and 4B models into a local setup, achieving significantly faster performance than API-based models. This was accomplished by wrapping the LiteRT engine, designed for …