ENTITY SGLang

SGLang

PulseAugur coverage of SGLang — every cluster mentioning SGLang across labs, papers, and developer communities, ranked by signal.

Show in brief

Total · 30d

58 over 90d

Releases · 30d

0 over 90d

Papers · 30d

16 over 90d

TIER MIX · 90D

frontier release 9
significant 4
research 12
tool 30
commentary 2
meme 1

TOPICS

product 33
infra 32
model release 30
paper 16
other 3
safety 2
funding 1

RELATIONSHIPS

used by vLLM 70%
used by transformers 70%
used by graphics processing unit 70%
used by Ollama 70%
used by llama-cpp-python 60%
affiliated with vLLM 50%
affiliated with transformers 50%
competes with vLLM 50%
used by llama.cpp 50%
used by Raspberry Pi 50%

TIMELINE

2026-01-09 product_launch SGLang released version 0.3.1 of its model gateway, featuring performance and memory improvements. source

SENTIMENT · 30D

19 day(s) with sentiment data

RECENT · PAGE 2/3 · 58 TOTAL

RESEARCH · CL_64767 · May 26 · 09:09

JetBrains releases Mellum2 reasoning model with 131K context

JetBrains has released its Mellum2 model family, including the Mellum2-12B-A2.5B-Thinking variant, which is designed for complex reasoning tasks. This model utilizes a Mixture-of-Experts architecture with a large contex…
TOOL · CL_50813 · May 26 · 04:00

New method speeds up RLHF training with adaptive parallelism

Researchers have developed a new method called PAT to accelerate the training of Reinforcement Learning from Human Feedback (RLHF) models. This technique dynamically adjusts tensor parallelism during the generation stag…
FRONTIER RELEASE · CL_57657 · May 24 · 22:16

Liquid AI ships LFM2.5-8B-A1B on-device MoE model

Liquid AI has released LFM2.5-8B-A1B, a new on-device Mixture-of-Experts (MoE) model designed for complex tasks and tool chaining. This model features 8.3 billion total parameters but activates only 1.5 billion per toke…
FRONTIER RELEASE · CL_58091 · May 23 · 02:13

Stepfun AI releases 198B parameter multimodal MoE model

Stepfun AI has released Step 3.7 Flash, a 198-billion parameter sparse Mixture-of-Experts (MoE) vision-language model. This model is optimized for agentic workflows, coding, and multimodal tasks, activating approximatel…
TOOL · CL_44370 · May 22 · 16:01

Modal achieves serverless GPUs for AI inference in seconds

Modal has developed a system to achieve truly serverless GPUs for AI inference, addressing the challenge of rapidly scaling resources to meet variable demand. Their approach involves maintaining cloud buffers of idle GP…
RESEARCH · CL_48751 · May 22 · 00:00

LLMs and new frameworks boost GPU kernel optimization

Researchers are exploring novel ways to optimize GPU kernel performance for large language models. One approach uses language models as surrogates to predict kernel performance, significantly increasing the number of ca…
SIGNIFICANT · CL_49676 · May 21 · 07:27

OpenBMB releases MiniCPM5-1B for on-device AI tasks

OpenBMB has released MiniCPM5-1B, a 1-billion parameter Transformer model designed for on-device and resource-constrained environments. This model claims state-of-the-art performance within its size class, particularly …
TOOL · CL_69323 · May 21 · 04:15

Hugging Face releases Qwen/Qwen-Image-Bench multimodal model

Hugging Face has released Qwen/Qwen-Image-Bench, a new multimodal model capable of processing both text and images. The model is accessible through various libraries and tools, including Transformers, vLLM, and SGLang. …
RESEARCH · CL_47600 · May 21 · 00:00

AI cloud platform Modal raises $355M at $4.65B valuation

Modal has secured $355 million in Series C funding, valuing the company at $4.65 billion post-money. The company has experienced significant growth, with annualized revenue surpassing $300 million and a fivefold increas…
COMMENTARY · CL_41324 · May 20 · 19:41

Google Spark vs. OpenClaw: AI debate centers on workflow control, not model smarts

A Reddit discussion reveals that the competition between Google Spark and OpenClaw is not about which AI model is smarter, but rather about control over user workflows. Google Spark leverages its ecosystem of cloud serv…
TOOL · CL_42512 · May 20 · 15:51

New method speeds up triangular inversion for linear transformers

Researchers have developed a new method for triangular inversion, a crucial operation in linear attention mechanisms used by advanced models like Qwen3.5/3.6 and Kimi Linear. This technique significantly improves the sp…
TOOL · CL_40951 · May 20 · 11:37

vLLM production guide details key config decisions for performance

This article provides a guide for optimizing vLLM deployments, focusing on three critical configuration decisions that impact performance and cost. It details how static KV cache allocation can lead to GPU out-of-memory…
TOOL · CL_39129 · May 19 · 13:28

SGLang's Radix Cache explained via LeetCode problems

The Radix Cache, a key component in SGLang's high-throughput LLM processing, optimizes performance by reusing computed KV cache prefixes across requests. This is achieved by storing these prefixes in a Radix Tree, simil…
FRONTIER RELEASE · CL_71083 · May 15 · 21:52

NVIDIA releases Nemotron-3 Ultra 550B LLM for advanced reasoning

NVIDIA has released its Nemotron-3 Ultra 550B model, a large language model designed for advanced reasoning and agentic workflows. This model features a hybrid LatentMoE architecture with Mamba-2 and attention layers, s…
TOOL · CL_33818 · May 15 · 21:31

PyTorch tutorial simplifies distributed AI model inference

This article explains distributed inference techniques for large AI models using PyTorch. It details how to implement Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) with minimal code. The …
RESEARCH · CL_31391 · May 14 · 09:51

Moore Threads rallies open-source AI dev community for MUSA GPU ecosystem

Chinese GPU maker Moore Threads has convened a meetup focused on integrating its MUSA architecture with key open-source large model inference frameworks like SGLang. The event brought together core developers from proje…
SIGNIFICANT · CL_29336 · May 13 · 01:42

AMD invests $3.6M in AI dev clusters to boost ROCm ecosystem

AMD is making significant efforts to support the open-source AI community, particularly with its ROCm software stack. The company has recently provided access to interconnected MI355X development clusters, valued at $3.…
TOOL · CL_82391 · May 12 · 16:09

Hugging Face releases Harness-1, a 20B search agent model

A new 20-billion parameter search agent model named Harness-1 has been released on Hugging Face. This model is designed to match the search capabilities of frontier AI systems and is based on the openai/gpt-oss-20b mode…
RESEARCH · CL_23335 · May 8 · 17:37

New techniques boost small LLM Bash generation and speed up AI inference

Researchers have developed a technique called grammar-constrained decoding to improve the Bash command generation capabilities of small language models. This method enhances accuracy and safety, transforming natural lan…
RESEARCH · CL_23761 · May 6 · 17:45

Modal boosts multimodal inference performance over 10% with Python dict

Modal has identified a performance bottleneck in multimodal inference engines like SGLang, which can hinder GPU utilization. By profiling the scheduler, they discovered that expensive bookkeeping for shared GPU memory c…

JetBrains releases Mellum2 reasoning model with 131K context

New method speeds up RLHF training with adaptive parallelism

Liquid AI ships LFM2.5-8B-A1B on-device MoE model

Stepfun AI releases 198B parameter multimodal MoE model

Modal achieves serverless GPUs for AI inference in seconds

LLMs and new frameworks boost GPU kernel optimization

OpenBMB releases MiniCPM5-1B for on-device AI tasks

Hugging Face releases Qwen/Qwen-Image-Bench multimodal model

AI cloud platform Modal raises $355M at $4.65B valuation

Google Spark vs. OpenClaw: AI debate centers on workflow control, not model smarts

New method speeds up triangular inversion for linear transformers

vLLM production guide details key config decisions for performance

SGLang's Radix Cache explained via LeetCode problems

NVIDIA releases Nemotron-3 Ultra 550B LLM for advanced reasoning

PyTorch tutorial simplifies distributed AI model inference

Moore Threads rallies open-source AI dev community for MUSA GPU ecosystem

AMD invests $3.6M in AI dev clusters to boost ROCm ecosystem

Hugging Face releases Harness-1, a 20B search agent model

New techniques boost small LLM Bash generation and speed up AI inference

Modal boosts multimodal inference performance over 10% with Python dict