ENTITY SGLang

SGLang

PulseAugur coverage of SGLang — every cluster mentioning SGLang across labs, papers, and developer communities, ranked by signal.

Total · 30d

15

15 over 90d

Releases · 30d

0

0 over 90d

Papers · 30d

6

6 over 90d

TIER MIX · 90D

frontier release 1
significant 1
research 4
tool 9

RELATIONSHIPS

used by vLLM 70%

SENTIMENT · 30D

2 day(s) with sentiment data

RECENT · PAGE 1/1 · 16 TOTAL

SIGNIFICANT · CL_29336 · May 13 · 01:42

AMD invests $3.6M in AI dev clusters to boost ROCm ecosystem

AMD is making significant efforts to support the open-source AI community, particularly with its ROCm software stack. The company has recently provided access to interconnected MI355X development clusters, valued at $3.…
SIGNIFICANT · CL_27891 · May 11 · 05:44

Thinking Machines previews real-time interaction models; OpenAI launches deployment unit

Thinking Machines has previewed new "interaction models" designed for real-time, continuous human-AI collaboration, moving beyond traditional turn-based systems. OpenAI is expanding its enterprise focus with the launch …
RESEARCH · CL_23335 · May 8 · 17:37

New techniques boost small LLM Bash generation and speed up AI inference

Researchers have developed a technique called grammar-constrained decoding to improve the Bash command generation capabilities of small language models. This method enhances accuracy and safety, transforming natural lan…
SIGNIFICANT · CL_21070 · May 7 · 14:02

Anthropic boosts Claude Opus API limits; Google's Gemma 4 speeds inference; GPT-5.5 Instant now ChatGPT default

Anthropic has increased API limits for its Claude Opus model, aiming to reduce throttling for demanding workloads like agentic tasks, coding, and batch processing. Google is advancing speculative decoding with its Gemma…
RESEARCH · CL_23761 · May 6 · 17:45

Modal boosts multimodal inference performance over 10% with Python dict

Modal has identified a performance bottleneck in multimodal inference engines like SGLang, which can hinder GPU utilization. By profiling the scheduler, they discovered that expensive bookkeeping for shared GPU memory c…
TOOL · CL_19382 · May 6 · 13:00

SGLang's MI355x boosts DeepSeekv4 Pro throughput over 10x per GPU

DeepSeekv4 Pro has seen a significant performance increase, achieving over tenfold improvement in throughput per GPU. This advancement was realized through the integration of MI355x on the SGLang framework. The gains re…
TOOL · CL_16238 · May 5 · 04:00

Aurora system unifies RL training and serving for faster LLM inference

Researchers have developed Aurora, a novel system that unifies the training and serving of speculative decoding for large language models. This approach addresses the delays and performance degradation associated with t…
RESEARCH · CL_11567 · May 1 · 03:46

Moore Threads completes full-link engineering adaptation for DeepSeek-V4

Moore Threads has successfully adapted the DeepSeek-V4 large language model to run on its flagship AI training and inference accelerator card, the MTT S5000. This integration was achieved using the company's proprietary…
RESEARCH · CL_14133 · May 1 · 01:52

EVICT method speeds up MoE speculative decoding by optimizing verification

Researchers have developed EVICT, a new method to improve the efficiency of speculative decoding for Mixture-of-Experts (MoE) models. This technique adaptively truncates the draft tree during verification, focusing on c…
RESEARCH · CL_10143 · Apr 30 · 04:00

Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

Researchers have developed UniPrefill, a novel framework designed to accelerate the prefill stage of long-context language models. Unlike previous methods that primarily benefit full-attention models, UniPrefill works a…
RESEARCH · CL_09151 · Apr 29 · 14:10

SGLang AI inference server hit with critical CVE-2026-5760 vulnerability

A critical security vulnerability (CVE-2026-5760) with a severity score of 9.8 has been identified in SGLang, an AI inference server. The issue arises from a poisoned GGUF model file containing a chat-template that SGLa…
RESEARCH · CL_09107 · Apr 29 · 13:19

Intel releases AutoRound toolkit for efficient LLM quantization

Intel has released AutoRound, an advanced toolkit for quantizing Large Language Models (LLMs) and Vision-Language Models (VLMs). This toolkit enables high accuracy at very low bit widths, specifically 2-4 bits, with min…
RESEARCH · CL_05379 · Apr 27 · 10:04

AI models see tool-calling improvements and bug fixes

A new tool has been developed that addresses a need identified by Andrej Karpathy, with its creation reportedly taking only 48 hours. Separately, a bug affecting DeepSeek V4's output in the SGLang open-source inference …
RESEARCH · CL_03565 · Apr 25 · 16:31

GLM 5.1 achieves 40 tokens/sec locally on RTX 6000 Pro cards

A user on the r/LocalLLaMA subreddit has successfully optimized the GLM 5.1 model for local deployment, achieving impressive performance metrics. By applying specific patches to the sglang inference software and utilizi…
FRONTIER RELEASE · CL_01752 · Jul 28 · 05:44

MiniMax 2.7: GLM-5 at 1/3 cost SOTA Open Model

MiniMax has released MiniMax 2.7, an open-source model that matches the performance of Z.ai's GLM-5 on several benchmarks but at a significantly lower cost. The model is noted for its efficiency and claims to be the fir…
FRONTIER RELEASE · CL_00821 · Jan 19 · 04:00

DeepSeek v3 leads open-weight models, Baseten enables mission-critical inference

DeepSeek v3, a new 671B parameter Mixture-of-Experts model, has been released and is currently the top-performing open-weights model available. Serving such large models presents significant challenges, but inference st…