vLLM
PulseAugur coverage of vLLM — every cluster mentioning vLLM across labs, papers, and developer communities, ranked by signal.
- used by graphics processing unit 90%
- used by H.1000 Gnome 80%
- used by llama-cpp-python 70%
- used by Fp8 70%
- used by Horizon 2020 70%
- uses Anyscale, Inc. 70%
- competes with Text Generation Inference 60%
- used by Mlx 60%
- uses LM Studio 60%
- affiliated with Anyscale, Inc. 50%
- affiliated with LM Studio 50%
- affiliated with llama-cpp-python 50%
- 2026-05-15 product_launch vLLM released version 0.21.1rc0.
15 天有情绪数据
-
Modal boosts multimodal inference performance over 10% with Python dict
Modal has identified a performance bottleneck in multimodal inference engines like SGLang, which can hinder GPU utilization. By profiling the scheduler, they discovered that expensive bookkeeping for shared GPU memory c…
-
Developer builds mini vLLM from scratch, detailing PagedInfer and optimization techniques
A technical blog post details the creation of a custom inference engine for large language models, named PagedInfer. The author outlines a five-notebook process that starts with a basic transformer model and progresses …
-
Nvidia's GB300 GPU shows 2.7x faster inference than GB200
Nvidia's GB300 ultra NVL72 has demonstrated a 2.7x speed advantage over the GB200 NVL72 in inference tasks using the vLLM project's engine. This performance leap exceeds theoretical expectations based on the GB300's spe…
-
HeadQ: 模型可见失真与分数空间校正用于KV缓存量化
研究人员正在开发几种新颖的方法来优化大型语言模型中的键值(KV)缓存,这是长上下文处理的主要瓶颈。这些方法包括训练模型内在生成可压缩表示(KV-CAT)、操纵潜在注意力空间以实现高效引导(Memory Inception)以及采用先进的量化技术,如int4和谱去噪(eOptShrinkQ、HeadQ)。此外,用于多模态模型的WindowQuant和用于分布式KV缓存管理的tierKV等新策略旨在减少延迟和内存使用,其中tierKV甚至…
-
Utilyze offers open-source tool for deeper GPU performance insights beyond load
Utilyze is a new open-source tool designed to provide deeper insights into GPU performance beyond simple load percentages. It directly accesses GPU performance counters to measure the actual utilization and efficiency o…
-
Google's Gemma 4 models achieve 3x speed boost with speculative decoding
Google has released Multi-Token Prediction (MTP) drafters for its Gemma 4 open models, which can increase inference speed by up to three times. This advancement utilizes a speculative decoding architecture, allowing a l…
-
vLLM releases v0.20.2rc0 with new shutdown method
vLLM has released version 0.20.2rc0, introducing a new shutdown() method. This update is part of the ongoing development of the vLLM project, which focuses on efficient LLM inference.
-
NVIDIA NeMo RL uses speculative decoding for 1.8x faster AI training
NVIDIA Research has integrated speculative decoding into its NeMo RL framework, resulting in a 1.8x speedup for rollout generation at an 8 billion parameter scale. This advancement, utilizing a vLLM backend, is projecte…
-
FluxMoE system decouples expert weights for faster LLM serving
Researchers have developed FluxMoE, a new system designed to improve the efficiency of serving Mixture-of-Experts (MoE) models. FluxMoE addresses the challenge of large parameter sizes in MoE models by decoupling expert…
-
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation
Researchers have developed UniPrefill, a novel framework designed to accelerate the prefill stage of long-context language models. Unlike previous methods that primarily benefit full-attention models, UniPrefill works a…
-
New research details speculative decoding for faster RL post-training rollouts
Researchers have developed a system-integrated speculative decoding method to accelerate the post-training rollout generation for large language models. This technique, implemented within NeMo-RL with a vLLM backend, ac…
-
SGLang AI inference server hit with critical CVE-2026-5760 vulnerability
A critical security vulnerability (CVE-2026-5760) with a severity score of 9.8 has been identified in SGLang, an AI inference server. The issue arises from a poisoned GGUF model file containing a chat-template that SGLa…
-
Stateful Transformers boost streaming inference; Intel releases AutoRound quantization toolkit
A new paper introduces a stateful transformer inference engine that significantly speeds up processing for streaming data by maintaining a persistent KV cache. This approach allows for query latency that is independent …
-
User sets up vLLM for parallel LLM inference experiments
The user is setting up vLLM to conduct experiments with parallel inference for large language models. The goal is to have a single model generate multiple solutions for tasks, such as coding functions or tests, which ca…
-
Numind发布NuExtract3以实现文档理解
Numind发布了NuExtract3,这是一个拥有40亿参数的视觉语言模型,专为文档理解而设计。该模型在结构化信息提取和将图像转换为Markdown方面表现出色,使其在OCR、RAG预处理和处理各种文档类型方面非常有用。NuExtract3支持多模态输入、多语言文档,并提供推理和非推理两种推理模式,同时已有多种量化格式可用。
-
Hugging Face 托管微调版 Qwen 3.6 模型
Hugging Face 托管了两个微调版的 Qwen 3.6 模型,一个拥有 400 亿参数,另一个拥有 270 亿参数。这两个模型分别命名为 'DavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NEO-CODE-Di-IMatrix-MAX-GGUF' 和 'DavidAU/Qwen3.6-27B-Heretic-Uncensored-F…
-
IBM Power systems now support vLLM for AI model deployment
IBM's community blog details how to set up and run vLLM, an open-source library for fast LLM inference, on IBM Power systems. The guide aims to enable efficient deployment of large language models on this specific hardw…
-
IBM Research integrates vLLM into its RITS Platform for AI development
IBM Research has integrated vLLM, an open-source library for fast LLM inference, into its RITS Platform. This integration aims to enhance the platform's capabilities by leveraging vLLM's efficient processing for large l…
-
New research explores LLM security, efficiency, and training optimization
Researchers are developing novel methods to enhance the efficiency and security of Large Language Models (LLMs). One approach, "Widening the Gap," exploits outlier injection to compromise LLM quantization, demonstrating…
-
Fireworks AI 在修复关键错误后发布 DeepSeek V4 Pro
Fireworks AI 发布了 DeepSeek V4 Pro,这是一个开源模型,在长上下文推理、代理性能和推理效率方面取得了显著进步。该模型采用混合专家架构和 1M token 上下文窗口,旨在以经济高效的方式处理广泛的状态和复杂的代理工作流。Fireworks AI 推迟了公开发布,以解决导致推理退化和输出损坏的关键服务路径正确性问题,确保在发布前已做好生产准备。