SGLang
PulseAugur coverage of SGLang — every cluster mentioning SGLang across labs, papers, and developer communities, ranked by signal.
2 day(s) with sentiment data
-
AMD invests $3.6M in AI dev clusters to boost ROCm ecosystem
AMD is making significant efforts to support the open-source AI community, particularly with its ROCm software stack. The company has recently provided access to interconnected MI355X development clusters, valued at $3.…
-
Thinking Machines previews real-time interaction models; OpenAI launches deployment unit
Thinking Machines has previewed new "interaction models" designed for real-time, continuous human-AI collaboration, moving beyond traditional turn-based systems. OpenAI is expanding its enterprise focus with the launch …
-
New techniques boost small LLM Bash generation and speed up AI inference
Researchers have developed a technique called grammar-constrained decoding to improve the Bash command generation capabilities of small language models. This method enhances accuracy and safety, transforming natural lan…
-
Anthropic boosts Claude Opus API limits; Google's Gemma 4 speeds inference; GPT-5.5 Instant now ChatGPT default
Anthropic has increased API limits for its Claude Opus model, aiming to reduce throttling for demanding workloads like agentic tasks, coding, and batch processing. Google is advancing speculative decoding with its Gemma…
-
Modal boosts multimodal inference performance over 10% with Python dict
Modal has identified a performance bottleneck in multimodal inference engines like SGLang, which can hinder GPU utilization. By profiling the scheduler, they discovered that expensive bookkeeping for shared GPU memory c…
-
SGLang's MI355x boosts DeepSeekv4 Pro throughput over 10x per GPU
DeepSeekv4 Pro has seen a significant performance increase, achieving over tenfold improvement in throughput per GPU. This advancement was realized through the integration of MI355x on the SGLang framework. The gains re…
-
Aurora system unifies RL training and serving for faster LLM inference
Researchers have developed Aurora, a novel system that unifies the training and serving of speculative decoding for large language models. This approach addresses the delays and performance degradation associated with t…
-
Moore Threads completes full-link engineering adaptation for DeepSeek-V4
Moore Threads has successfully adapted the DeepSeek-V4 large language model to run on its flagship AI training and inference accelerator card, the MTT S5000. This integration was achieved using the company's proprietary…
-
EVICT method speeds up MoE speculative decoding by optimizing verification
Researchers have developed EVICT, a new method to improve the efficiency of speculative decoding for Mixture-of-Experts (MoE) models. This technique adaptively truncates the draft tree during verification, focusing on c…
-
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation
Researchers have developed UniPrefill, a novel framework designed to accelerate the prefill stage of long-context language models. Unlike previous methods that primarily benefit full-attention models, UniPrefill works a…
-
SGLang AI inference server hit with critical CVE-2026-5760 vulnerability
A critical security vulnerability (CVE-2026-5760) with a severity score of 9.8 has been identified in SGLang, an AI inference server. The issue arises from a poisoned GGUF model file containing a chat-template that SGLa…
-
Intel releases AutoRound toolkit for efficient LLM quantization
Intel has released AutoRound, an advanced toolkit for quantizing Large Language Models (LLMs) and Vision-Language Models (VLMs). This toolkit enables high accuracy at very low bit widths, specifically 2-4 bits, with min…
-
AI models see tool-calling improvements and bug fixes
A new tool has been developed that addresses a need identified by Andrej Karpathy, with its creation reportedly taking only 48 hours. Separately, a bug affecting DeepSeek V4's output in the SGLang open-source inference …
-
GLM 5.1 achieves 40 tokens/sec locally on RTX 6000 Pro cards
A user on the r/LocalLLaMA subreddit has successfully optimized the GLM 5.1 model for local deployment, achieving impressive performance metrics. By applying specific patches to the sglang inference software and utilizi…
-
MiniMax 2.7: GLM-5 at 1/3 cost SOTA Open Model
MiniMax has released MiniMax 2.7, an open-source model that matches the performance of Z.ai's GLM-5 on several benchmarks but at a significantly lower cost. The model is noted for its efficiency and claims to be the fir…
-
DeepSeek v3 leads open-weight models, Baseten enables mission-critical inference
DeepSeek v3, a new 671B parameter Mixture-of-Experts model, has been released and is currently the top-performing open-weights model available. Serving such large models presents significant challenges, but inference st…