llama-server
PulseAugur coverage of llama-server — every cluster mentioning llama-server across labs, papers, and developer communities, ranked by signal.
11 day(s) with sentiment data
-
llama.cpp web UI fails after recompilation, CLI and server functional
A user is experiencing issues with the llama-server web UI not responding to prompts, although the command-line interface and server itself appear to be functioning correctly. The web UI loads and can even load models, …
-
Empero AI releases Qwythos-9B reasoning model with 1M context window
The empero-ai/Qwythos-9B-Claude-Mythos-5-1M model, a 9B parameter reasoning model, has been released and is available on Hugging Face. This model is built upon Qwen3.5-9B and fine-tuned with Claude Mythos and Fable trac…
-
llama-bench defaults corrected for flash attention and GPU layers
A recent build, b9437, for the llama-bench tool has corrected default settings related to flash attention and GPU layer counts. Previously, the tool hard-coded flash attention off, even on compatible hardware, and used …
-
Deo image-to-prompt tool adds LMStudio, Llama Server support
Deo has released version 1.1, enhancing its capabilities as an image-to-prompt generator. This update introduces experimental support for LMStudio and Llama Server, alongside improvements to prompt accuracy and quality …
-
Unsloth Releases 0.1.461-beta with GGUF Vision Fixes
Unsloth has released version 0.1.461-beta, which includes several fixes related to the local GGUF vision functionality within its studio environment. These updates aim to improve how the system handles GGUF files, parti…
-
Hyperparameter search yields minor gains for speculative decoding
A user on Reddit's r/LocalLLaMA subreddit shared their experience with hyperparameter tuning for speculative decoding, specifically using the "draft-mtp" method with the Qwen3.6 27B model on a Strix Halo platform. Despi…
-
llama-server router allocates CUDA context on all GPUs, causing OOM errors
A user on the r/LocalLLaMA subreddit is encountering an issue with the llama-server router mode where each model instance, even when pinned to a specific GPU, allocates a CUDA context on all available GPUs. This behavio…
-
Open-source tools simplify local LLM management with llama.cpp
Two developers have released open-source tools to simplify the use of llama.cpp, a popular framework for running large language models locally. One tool, llama-launcher, offers a point-and-click graphical interface for …
-
LocalLLaMA users seek portable voice interface for local AI models
A user on the r/LocalLLaMA subreddit is seeking information about existing portable devices that can connect to local language models for speech-to-text and text-to-speech interaction. The ideal device would be a small,…
-
Qwen3.6 model halts mid-response when used with OpenCode
A user on Reddit's r/LocalLLaMA forum is experiencing an issue with the Qwen3.6-27B model when used with OpenCode and llama-server for AI coding. The model sometimes stops generating responses mid-completion, requiring …
-
LlamaStash benchmarks show no overhead vs. llama-server, beats Ollama
LlamaStash, a new wrapper for running local LLMs, has been benchmarked against Ollama and LM Studio, demonstrating comparable or superior performance. The wrapper adds no measurable overhead compared to running llama-se…
-
Qwen3.6-27B-MTP-pi-tune-GGUF model now available for diverse AI tools
The bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF model is now available for use with various popular AI tools and libraries. Instructions are provided for integrating it with llama-cpp-python, llama.cpp, vLLM, Ollama, and Unslot…
-
Ollama v0.30.0-rc32 improves multi-GPU support and embeddings API
Ollama has released a release candidate version v0.30.0-rc32, which includes several follow-up fixes and improvements for its llama-server functionality. These updates address issues with ROCm build flags for multi-GPU …
-
LocalLLaMA user seeks llama-swap concurrent request fix
A user on the r/LocalLLaMA subreddit is seeking assistance with configuring llama-swap to handle concurrent requests for a single model. They have successfully set up Qwen 3.6 35B A3B with multi-GPU support and concurre…
-
User seeks help optimizing MTP in llama.cpp server
A user on Reddit is seeking assistance with implementing the "draft-mtp" (Multi-Turn Prompting) feature in the llama.cpp server. They have downloaded a specific model, Qwen3.6-35B-A3B-MTP-GGUF, and are attempting to run…
-
Local LLMs Match Claude Haiku Quality, Fall Short on Sonnet Rewrites
A technical blog post benchmarks the Claude Agent SDK's performance when using local LLMs, specifically Qwen models, against Anthropic's Haiku and Sonnet tiers. The evaluation found that a local 35B model can match or e…
-
LocalLLaMA users seek MTP integration for llama-bench
Users on the r/LocalLLaMA subreddit are seeking a solution to integrate llama-bench with MTP, as standard methods that work with llama-server are failing. The core issue appears to be compatibility, with speculation tha…
-
LocalLLaMA users discuss preferred frontends for local LLMs
Users on the r/LocalLLaMA subreddit are discussing their preferred frontends for interacting with local large language models. One user shared their unconventional setup using Vim with a custom text completion plugin, w…
-
Quantized Qwen3.6-27B model achieves 100k context on 16GB VRAM
A user on Reddit's r/LocalLLaMA has detailed a method for running the Qwen3.6-27B model on a system with 16GB of VRAM, achieving a context length of 100,000 tokens. The process involves creating a custom GGUF quantizati…
-
Qwen3.6-27B model offers flagship coding performance in a smaller package
Qwen has released Qwen3.6-27B, an open-weight model that reportedly matches flagship-level coding performance. This new model significantly outperforms its predecessor, Qwen3.5-397B-A17B, while being substantially small…