SGLang
PulseAugur coverage of SGLang — every cluster mentioning SGLang across labs, papers, and developer communities, ranked by signal.
- 2026-01-09 product_launch SGLang released version 0.3.1 of its model gateway, featuring performance and memory improvements. source
19 day(s) with sentiment data
-
JetBrains releases Mellum2 reasoning model with 131K context
JetBrains has released its Mellum2 model family, including the Mellum2-12B-A2.5B-Thinking variant, which is designed for complex reasoning tasks. This model utilizes a Mixture-of-Experts architecture with a large contex…
-
New method speeds up RLHF training with adaptive parallelism
Researchers have developed a new method called PAT to accelerate the training of Reinforcement Learning from Human Feedback (RLHF) models. This technique dynamically adjusts tensor parallelism during the generation stag…
-
Liquid AI ships LFM2.5-8B-A1B on-device MoE model
Liquid AI has released LFM2.5-8B-A1B, a new on-device Mixture-of-Experts (MoE) model designed for complex tasks and tool chaining. This model features 8.3 billion total parameters but activates only 1.5 billion per toke…
-
Stepfun AI releases 198B parameter multimodal MoE model
Stepfun AI has released Step 3.7 Flash, a 198-billion parameter sparse Mixture-of-Experts (MoE) vision-language model. This model is optimized for agentic workflows, coding, and multimodal tasks, activating approximatel…
-
Modal achieves serverless GPUs for AI inference in seconds
Modal has developed a system to achieve truly serverless GPUs for AI inference, addressing the challenge of rapidly scaling resources to meet variable demand. Their approach involves maintaining cloud buffers of idle GP…
-
LLMs and new frameworks boost GPU kernel optimization
Researchers are exploring novel ways to optimize GPU kernel performance for large language models. One approach uses language models as surrogates to predict kernel performance, significantly increasing the number of ca…
-
OpenBMB releases MiniCPM5-1B for on-device AI tasks
OpenBMB has released MiniCPM5-1B, a 1-billion parameter Transformer model designed for on-device and resource-constrained environments. This model claims state-of-the-art performance within its size class, particularly …
-
Hugging Face releases Qwen/Qwen-Image-Bench multimodal model
Hugging Face has released Qwen/Qwen-Image-Bench, a new multimodal model capable of processing both text and images. The model is accessible through various libraries and tools, including Transformers, vLLM, and SGLang. …
-
AI cloud platform Modal raises $355M at $4.65B valuation
Modal has secured $355 million in Series C funding, valuing the company at $4.65 billion post-money. The company has experienced significant growth, with annualized revenue surpassing $300 million and a fivefold increas…
-
Google Spark vs. OpenClaw: AI debate centers on workflow control, not model smarts
A Reddit discussion reveals that the competition between Google Spark and OpenClaw is not about which AI model is smarter, but rather about control over user workflows. Google Spark leverages its ecosystem of cloud serv…
-
New method speeds up triangular inversion for linear transformers
Researchers have developed a new method for triangular inversion, a crucial operation in linear attention mechanisms used by advanced models like Qwen3.5/3.6 and Kimi Linear. This technique significantly improves the sp…
-
vLLM production guide details key config decisions for performance
This article provides a guide for optimizing vLLM deployments, focusing on three critical configuration decisions that impact performance and cost. It details how static KV cache allocation can lead to GPU out-of-memory…
-
SGLang's Radix Cache explained via LeetCode problems
The Radix Cache, a key component in SGLang's high-throughput LLM processing, optimizes performance by reusing computed KV cache prefixes across requests. This is achieved by storing these prefixes in a Radix Tree, simil…
-
NVIDIA releases Nemotron-3 Ultra 550B LLM for advanced reasoning
NVIDIA has released its Nemotron-3 Ultra 550B model, a large language model designed for advanced reasoning and agentic workflows. This model features a hybrid LatentMoE architecture with Mamba-2 and attention layers, s…
-
PyTorch tutorial simplifies distributed AI model inference
This article explains distributed inference techniques for large AI models using PyTorch. It details how to implement Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) with minimal code. The …
-
Moore Threads rallies open-source AI dev community for MUSA GPU ecosystem
Chinese GPU maker Moore Threads has convened a meetup focused on integrating its MUSA architecture with key open-source large model inference frameworks like SGLang. The event brought together core developers from proje…
-
AMD invests $3.6M in AI dev clusters to boost ROCm ecosystem
AMD is making significant efforts to support the open-source AI community, particularly with its ROCm software stack. The company has recently provided access to interconnected MI355X development clusters, valued at $3.…
-
Hugging Face releases Harness-1, a 20B search agent model
A new 20-billion parameter search agent model named Harness-1 has been released on Hugging Face. This model is designed to match the search capabilities of frontier AI systems and is based on the openai/gpt-oss-20b mode…
-
New techniques boost small LLM Bash generation and speed up AI inference
Researchers have developed a technique called grammar-constrained decoding to improve the Bash command generation capabilities of small language models. This method enhances accuracy and safety, transforming natural lan…
-
Modal boosts multimodal inference performance over 10% with Python dict
Modal has identified a performance bottleneck in multimodal inference engines like SGLang, which can hinder GPU utilization. By profiling the scheduler, they discovered that expensive bookkeeping for shared GPU memory c…