vLLM
PulseAugur coverage of vLLM — every cluster mentioning vLLM across labs, papers, and developer communities, ranked by signal.
- used by Nexus Labs 90%
- used by H.1000 Gnome 80%
- used by graphics processing unit 70%
- used by Mlx 70%
- used by llama-cpp-python 70%
- used by Gemma 4 70%
- used by Gemma 4:12b 70%
- used by LM Studio 70%
- used by Fp8 70%
- competes with Text Generation Inference 70%
- uses Anyscale, Inc. 70%
- used by Qwen-3.6-27b 70%
- 2026-06-04 product_launch vLLM released version 0.22.1, including a fix for DeepSeek-V4 initialization compatibility. source
- 2026-05-29 product_launch vLLM merged a pull request for a new HIP W4A16 kernel, enhancing performance. source
- 2026-05-28 product_launch vLLM released version 0.22.0rc3. source
- 2026-05-26 product_launch Nexus Labs implemented and tested vLLM's prefix caching feature, observing significant latency improvements for AI agents. source
- 2026-05-15 product_launch vLLM released version 0.21.1rc0.
29 day(s) with sentiment data
-
AI inference verification achieved with bit-exact precision
Researchers have developed a method to verify AI inference results with bit-exact precision, overcoming the challenge posed by non-deterministic GPU arithmetic. Their approach analyzes accumulated rounding errors as an …
-
Odysseus launches as privacy-focused, self-hosted AI workspace
Odysseus is a self-hosted AI workspace emphasizing local-first operation and user privacy. It integrates various functionalities including chat, agents, a cookbook for model management, deep research tools, model compar…
-
JetBrains ships Mellum2, Heretic tool aids censorship removal, NVIDIA launches Cosmos 3
JetBrains has released Mellum2, a 12-billion parameter Mixture-of-Experts model optimized for efficient local AI inference. Concurrently, a new tool called 'Heretic' has emerged on GitHub, designed to automatically remo…
-
AWS cuts LLM load times with GPUDirect Storage and FSx
AWS has introduced a new method to significantly speed up the loading of large language models onto GPU instances. By leveraging NVIDIA GPUDirect Storage (GDS) with Amazon FSx for Lustre, model weights can be loaded dir…
-
Developers need fine-tuned small language models for production
Fine-tuning small language models is becoming a crucial production workflow for developers dealing with high-volume, repetitive tasks. This approach offers lower latency, predictable costs, and improved security compare…
-
Majestic Labs unveils Prometheus server with 128TB memory
AI startup Majestic Labs is developing a new server called Prometheus, designed to overcome the limitations of current AI hardware by significantly increasing memory capacity. The server will feature up to 128 terabytes…
-
Deepseek V4 Flash achieves 1M context on DGX Spark
A user has successfully configured Deepseek V4 Flash on a DGX Spark system, achieving a maximum context window of 1 million tokens in the KV cache. Performance tests show consistent throughput across various context len…
-
New Kernels Ensure Deterministic LLM Inference Across Tensor Parallel Sizes
Researchers have developed Tree-Based Invariant Kernels (TBIK) to ensure deterministic inference in large language models, regardless of tensor parallel (TP) size. This addresses a critical issue where identical inputs …
-
DriftSched improves LLM inference efficiency with adaptive scheduling
Researchers have developed DriftSched, a framework to improve the efficiency of multi-tenant GPU inference for large language models. This system addresses the challenge of runtime token drift, where actual output lengt…
-
Run LLMs Locally for Private Code Debugging
Developers can now run powerful open-source LLMs locally for code debugging and review, bypassing privacy concerns and API costs associated with cloud-based services like ChatGPT. Tools such as Ollama and LM Studio simp…
-
Mini PC User Questions AI Performance of MINISFORUM UM790 Pro
A user on the r/LocalLLaMA subreddit is inquiring about the performance of the MINISFORUM UM790 Pro mini PC for running AI models like llama.cpp and vLLM. They reference a claim that this $351 device is a notable option…
-
Anthropic's Claude Opus 4.8 offers incremental gains, platform updates
Anthropic has released Claude Opus 4.8, which offers incremental improvements over previous versions rather than a significant benchmark leap. While some users report minor gains in specific tasks like document parsing …
-
MTP boosts Gemma 4 and Qwen 3.6 inference speed by up to 3.34x
A user benchmarked Multi-Token Prediction (MTP) on Gemma 4 31B and Qwen 3.6 27B models using vLLM and llama.cpp, achieving up to a 3.34x inference speedup. The tests, conducted on an RTX 6000 PRO GPU, revealed that vLLM…
-
vLLM releases 0.22.1rc0 with faster test failure detection
vLLM has released version 0.22.1rc0, which includes improvements to its CI testing. Specifically, the release aims to make Model Executor test hangs fail faster by providing a traceback. This update is part of the ongoi…
-
User seeks $150K local inference server advice
A user on Reddit is seeking advice on building a local inference server with a budget of $150,000. Their current production server uses four H100 GPUs, and they are looking for a comparable or better alternative, consid…
-
Hcompany releases Holo-3.1-4B vision-language model
Hcompany has released Holo-3.1-4B, a new vision-language model designed for computer use agents. This model expands capabilities beyond desktop automation to include mobile environments and offers native function-callin…
-
vLLM adds HIP W4A16 kernel, boosting ROCm performance
The vLLM project has merged a pull request that introduces a native HIP W4A16 kernel, significantly boosting performance on ROCm-enabled hardware. This update shows substantial speed increases, with one configuration ac…
-
StepFunai releases 198B sparse MoE vision-language model
StepFunai has released Step-3.7-Flash, a 198 billion parameter sparse Mixture-of-Experts model. This new vision-language model offers day-zero support within the vLLM inference engine. The integration with vLLM is highl…
-
Unsloth releases optimized Gemma 4 models for local use
Unsloth has released several quantized versions of the Gemma 4 model, optimized for efficient local execution. These models, including `gemma-4-12B-it-qat-GGUF` and `gemma-4-12b-it-GGUF`, are available on Hugging Face. …
-
Nexus Labs cuts costs by serving 40 LoRA adapters on one Llama 3.1 model
Nexus Labs has developed a cost-effective method for serving multiple LoRA adapters on a single base model, significantly reducing infrastructure expenses. By utilizing vLLM's multi-LoRA serving capability, they consoli…