PulseAugur
实时 20:29:03
实体 SGLang

SGLang

PulseAugur coverage of SGLang — every cluster mentioning SGLang across labs, papers, and developer communities, ranked by signal.

Show in brief
总计 · 30天
30
90 天内 30
发布 · 30天
0
90 天内 0
论文 · 30天
11
90 天内 11
层级分布 · 90 天
关系
时间线
  1. 2026-01-09 product_launch SGLang released version 0.3.1 of its model gateway, featuring performance and memory improvements. 来源
情绪 · 30 天

9 天有情绪数据

最近 · 第 1/2 页 · 共 30 条
  1. TOOL · CL_44370 ·

    Modal achieves serverless GPUs for AI inference in seconds

    Modal has developed a system to achieve truly serverless GPUs for AI inference, addressing the challenge of rapidly scaling resources to meet variable demand. Their approach involves maintaining cloud buffers of idle GP…

  2. RESEARCH · CL_48751 ·

    New FastKernels benchmark targets GPU kernel generation for LLMs

    Researchers have introduced FastKernels, a new benchmark designed to better evaluate GPU kernel generation agents used in production LLM inference. Existing benchmarks are misaligned with real-world systems, leading age…

  3. SIGNIFICANT · CL_49676 ·

    OpenBMB releases MiniCPM5-1B for on-device AI tasks

    OpenBMB has released MiniCPM5-1B, a 1-billion parameter Transformer model designed for on-device and resource-constrained environments. This model claims state-of-the-art performance within its size class, particularly …

  4. RESEARCH · CL_47600 ·

    AI cloud platform Modal raises $355M at $4.65B valuation

    Modal has secured $355 million in Series C funding, valuing the company at $4.65 billion post-money. The company has experienced significant growth, with annualized revenue surpassing $300 million and a fivefold increas…

  5. COMMENTARY · CL_41324 ·

    Google Spark vs. OpenClaw: AI debate centers on workflow control, not model smarts

    A Reddit discussion reveals that the competition between Google Spark and OpenClaw is not about which AI model is smarter, but rather about control over user workflows. Google Spark leverages its ecosystem of cloud serv…

  6. TOOL · CL_42512 ·

    New method speeds up triangular inversion for linear transformers

    Researchers have developed a new method for triangular inversion, a crucial operation in linear attention mechanisms used by advanced models like Qwen3.5/3.6 and Kimi Linear. This technique significantly improves the sp…

  7. TOOL · CL_40951 ·

    vLLM production guide details key config decisions for performance

    This article provides a guide for optimizing vLLM deployments, focusing on three critical configuration decisions that impact performance and cost. It details how static KV cache allocation can lead to GPU out-of-memory…

  8. TOOL · CL_39129 ·

    SGLang's Radix Cache explained via LeetCode problems

    The Radix Cache, a key component in SGLang's high-throughput LLM processing, optimizes performance by reusing computed KV cache prefixes across requests. This is achieved by storing these prefixes in a Radix Tree, simil…

  9. TOOL · CL_33818 ·

    PyTorch tutorial simplifies distributed AI model inference

    This article explains distributed inference techniques for large AI models using PyTorch. It details how to implement Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) with minimal code. The …

  10. RESEARCH · CL_31391 ·

    Moore Threads rallies open-source AI dev community for MUSA GPU ecosystem

    Chinese GPU maker Moore Threads has convened a meetup focused on integrating its MUSA architecture with key open-source large model inference frameworks like SGLang. The event brought together core developers from proje…

  11. SIGNIFICANT · CL_29336 ·

    AMD invests $3.6M in AI dev clusters to boost ROCm ecosystem

    AMD is making significant efforts to support the open-source AI community, particularly with its ROCm software stack. The company has recently provided access to interconnected MI355X development clusters, valued at $3.…

  12. RESEARCH · CL_23335 ·

    New techniques boost small LLM Bash generation and speed up AI inference

    Researchers have developed a technique called grammar-constrained decoding to improve the Bash command generation capabilities of small language models. This method enhances accuracy and safety, transforming natural lan…

  13. RESEARCH · CL_23761 ·

    Modal boosts multimodal inference performance over 10% with Python dict

    Modal has identified a performance bottleneck in multimodal inference engines like SGLang, which can hinder GPU utilization. By profiling the scheduler, they discovered that expensive bookkeeping for shared GPU memory c…

  14. TOOL · CL_19382 ·

    SGLang's MI355x boosts DeepSeekv4 Pro throughput over 10x per GPU

    DeepSeekv4 Pro has seen a significant performance increase, achieving over tenfold improvement in throughput per GPU. This advancement was realized through the integration of MI355x on the SGLang framework. The gains re…

  15. TOOL · CL_16238 ·

    Aurora system unifies RL training and serving for faster LLM inference

    Researchers have developed Aurora, a novel system that unifies the training and serving of speculative decoding for large language models. This approach addresses the delays and performance degradation associated with t…

  16. RESEARCH · CL_11567 ·

    Moore Threads completes full-link engineering adaptation for DeepSeek-V4

    Moore Threads has successfully adapted the DeepSeek-V4 large language model to run on its flagship AI training and inference accelerator card, the MTT S5000. This integration was achieved using the company's proprietary…

  17. RESEARCH · CL_14133 ·

    EVICT method speeds up MoE speculative decoding by optimizing verification

    Researchers have developed EVICT, a new method to improve the efficiency of speculative decoding for Mixture-of-Experts (MoE) models. This technique adaptively truncates the draft tree during verification, focusing on c…

  18. RESEARCH · CL_10143 ·

    Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

    Researchers have developed UniPrefill, a novel framework designed to accelerate the prefill stage of long-context language models. Unlike previous methods that primarily benefit full-attention models, UniPrefill works a…

  19. RESEARCH · CL_09151 ·

    SGLang AI inference server hit with critical CVE-2026-5760 vulnerability

    A critical security vulnerability (CVE-2026-5760) with a severity score of 9.8 has been identified in SGLang, an AI inference server. The issue arises from a poisoned GGUF model file containing a chat-template that SGLa…

  20. RESEARCH · CL_09107 ·

    Stateful Transformers boost streaming inference; Intel releases AutoRound quantization toolkit

    A new paper introduces a stateful transformer inference engine that significantly speeds up processing for streaming data by maintaining a persistent KV cache. This approach allows for query latency that is independent …