SGLang
PulseAugur coverage of SGLang — every cluster mentioning SGLang across labs, papers, and developer communities, ranked by signal.
- 2026-01-09 product_launch SGLang released version 0.3.1 of its model gateway, featuring performance and memory improvements. source
19 day(s) with sentiment data
-
SGLang's MI355x boosts DeepSeekv4 Pro throughput over 10x per GPU
DeepSeekv4 Pro has seen a significant performance increase, achieving over tenfold improvement in throughput per GPU. This advancement was realized through the integration of MI355x on the SGLang framework. The gains re…
-
Aurora system unifies RL training and serving for faster LLM inference
Researchers have developed Aurora, a novel system that unifies the training and serving of speculative decoding for large language models. This approach addresses the delays and performance degradation associated with t…
-
Gemma 4 QAT models spark debate over performance and utility
Users are discussing the performance and utility of Gemma 4 QAT (Quantization Aware Training) models, particularly comparing them to standard quantizations. While some users report improved speed and quality for general…
-
Moore Threads completes full-link engineering adaptation for DeepSeek-V4
Moore Threads has successfully adapted the DeepSeek-V4 large language model to run on its flagship AI training and inference accelerator card, the MTT S5000. This integration was achieved using the company's proprietary…
-
EVICT method speeds up MoE speculative decoding by optimizing verification
Researchers have developed EVICT, a new method to improve the efficiency of speculative decoding for Mixture-of-Experts (MoE) models. This technique adaptively truncates the draft tree during verification, focusing on c…
-
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation
Researchers have developed UniPrefill, a novel framework designed to accelerate the prefill stage of long-context language models. Unlike previous methods that primarily benefit full-attention models, UniPrefill works a…
-
SGLang AI inference server hit with critical CVE-2026-5760 vulnerability
A critical security vulnerability (CVE-2026-5760) with a severity score of 9.8 has been identified in SGLang, an AI inference server. The issue arises from a poisoned GGUF model file containing a chat-template that SGLa…
-
Stateful Transformers boost streaming inference; Intel releases AutoRound quantization toolkit
A new paper introduces a stateful transformer inference engine that significantly speeds up processing for streaming data by maintaining a persistent KV cache. This approach allows for query latency that is independent …
-
AI models see tool-calling improvements and bug fixes
A new tool has been developed that addresses a need identified by Andrej Karpathy, with its creation reportedly taking only 48 hours. Separately, a bug affecting DeepSeek V4's output in the SGLang open-source inference …
-
New research explores LLM security, efficiency, and training optimization
Researchers are developing novel methods to enhance the efficiency and security of Large Language Models (LLMs). One approach, "Widening the Gap," exploits outlier injection to compromise LLM quantization, demonstrating…
-
Fireworks AI releases DeepSeek V4 Pro after fixing critical bugs
Fireworks AI has released DeepSeek V4 Pro, an open-source model notable for its advancements in long-context reasoning, agentic performance, and inference efficiency. The model features a mixture-of-experts architecture…
-
GLM 5.1 achieves 40 tokens/sec locally on RTX 6000 Pro cards
A user on the r/LocalLLaMA subreddit has successfully optimized the GLM 5.1 model for local deployment, achieving impressive performance metrics. By applying specific patches to the sglang inference software and utilizi…
-
Moonshot AI releases Kimi K2.6 multimodal agentic model
Moonshot AI has released Kimi K2.6, an open-source multimodal model designed for advanced agentic tasks. This model demonstrates significant improvements in long-horizon coding across multiple languages and domains. Kim…
-
Qwen releases 27B multimodal model for advanced coding
Qwen has released Qwen3.6-27B, a dense 27-billion-parameter multimodal model designed for advanced coding tasks. This model aims to provide flagship-level agentic coding performance, surpassing previous open-source mode…
-
SGLang boosts model gateway performance with cache-aware routing
SGLang has released version 0.3.1 of its model gateway, significantly boosting performance and reducing memory usage. The update introduces cache-aware routing that is 10-12x faster and uses 99% less memory, enabling 10…
-
NVIDIA Nemotron Diffusion models offer 6.4x faster AI inference
NVIDIA has released the Nemotron-Labs Diffusion family of language models, available in 3B, 8B, and 14B parameter sizes. These models uniquely support autoregressive (AR), diffusion, and self-speculation decoding modes …
-
MiniMax 2.7: GLM-5 at 1/3 cost SOTA Open Model
MiniMax has released MiniMax 2.7, an open-source model that matches the performance of Z.ai's GLM-5 on several benchmarks but at a significantly lower cost. The model is noted for its efficiency and claims to be the fir…
-
DeepSeek v3 leads open-weight models, Baseten enables mission-critical inference
DeepSeek v3, a new 671B parameter Mixture-of-Experts model, has been released and is currently the top-performing open-weights model available. Serving such large models presents significant challenges, but inference st…