SGLang
PulseAugur coverage of SGLang — every cluster mentioning SGLang across labs, papers, and developer communities, ranked by signal.
- 2026-01-09 product_launch SGLang released version 0.3.1 of its model gateway, featuring performance and memory improvements. 来源
10 天有情绪数据
-
Stateful Transformers boost streaming inference; Intel releases AutoRound quantization toolkit
A new paper introduces a stateful transformer inference engine that significantly speeds up processing for streaming data by maintaining a persistent KV cache. This approach allows for query latency that is independent …
-
AI模型在工具调用方面得到改进并修复了错误
一款新工具已被开发出来,满足了Andrej Karpathy提出的需求,据报道其开发仅用了48小时。另外,SGLang开源推理引擎中影响DeepSeek V4输出的一个错误已得到解决。此外,NousResearch的Ornstein-Hermes-3.6-27B模型的工具调用能力也得到了改进。
-
New research explores LLM security, efficiency, and training optimization
Researchers are developing novel methods to enhance the efficiency and security of Large Language Models (LLMs). One approach, "Widening the Gap," exploits outlier injection to compromise LLM quantization, demonstrating…
-
Fireworks AI 在修复关键错误后发布 DeepSeek V4 Pro
Fireworks AI 发布了 DeepSeek V4 Pro,这是一个开源模型,在长上下文推理、代理性能和推理效率方面取得了显著进步。该模型采用混合专家架构和 1M token 上下文窗口,旨在以经济高效的方式处理广泛的状态和复杂的代理工作流。Fireworks AI 推迟了公开发布,以解决导致推理退化和输出损坏的关键服务路径正确性问题,确保在发布前已做好生产准备。
-
GLM 5.1 achieves 40 tokens/sec locally on RTX 6000 Pro cards
A user on the r/LocalLLaMA subreddit has successfully optimized the GLM 5.1 model for local deployment, achieving impressive performance metrics. By applying specific patches to the sglang inference software and utilizi…
-
Moonshot AI 发布 Kimi K2.6 多模态代理模型
Moonshot AI 发布了 Kimi K2.6,一个开源的多模态模型,专为高级代理任务设计。该模型在多种语言和领域的长时程编码方面表现出显著的改进。Kimi K2.6 还擅长根据提示和视觉输入生成生产就绪的界面和全栈工作流,并注重美学精度。
-
Qwen 发布 27B 多模态模型,用于高级编码
Qwen 发布了 Qwen3.6-27B,这是一个拥有 270 亿参数的密集多模态模型,专为高级编码任务设计。该模型旨在提供旗舰级的智能体编码性能,超越了此前该类别中的开源模型。社区成员已经发布了 Qwen3.6-27B 的不同量化版本,可在 Hugging Face 上获取,方便其在不同平台和库中使用。
-
SGLang 通过感知缓存的路由提升模型网关性能
SGLang 发布了其模型网关 0.3.1 版本,显著提升了性能并减少了内存使用。此次更新引入了感知缓存的路由,速度提升 10-12 倍,内存使用减少 99%,在相同占用空间内可实现 100 倍的缓存条目。此版本还集成了企业级安全功能,如 JWT/OIDC 认证,并增加了对分类工作负载的支持。
-
NVIDIA Nemotron Diffusion模型提供6.4倍更快的AI推理速度
NVIDIA发布了Nemotron-Labs Diffusion系列语言模型,提供3B、8B和14B参数规模。这些模型在一个架构内独特地支持自回归(AR)、扩散和自推测解码模式,实现了显著的速度提升。通过并行生成token块而非顺序生成,Nemotron-Labs Diffusion的吞吐量比传统AR模型高出6.4倍,同时保持或提高了准确性。这一突破解决了AR模型固有的内存带宽瓶颈,使其在生产部署和代理系统中更高效。
-
MiniMax 2.7: GLM-5 at 1/3 cost SOTA Open Model
MiniMax has released MiniMax 2.7, an open-source model that matches the performance of Z.ai's GLM-5 on several benchmarks but at a significantly lower cost. The model is noted for its efficiency and claims to be the fir…
-
DeepSeek v3 leads open-weight models, Baseten enables mission-critical inference
DeepSeek v3, a new 671B parameter Mixture-of-Experts model, has been released and is currently the top-performing open-weights model available. Serving such large models presents significant challenges, but inference st…