English(EN) DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

DuetServe 框架通过自适应 GPU 多路复用优化 LLM 服务

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-02 04:00

研究人员开发了 DuetServe，一个旨在优化大型语言模型 (LLM) 服务的新框架。该系统通过智能管理 LLM 推理的独立 prefill 和 decode 阶段，解决了高吞吐量与低延迟之间的平衡挑战。DuetServe 在 SM (Streaming Multiprocessor) 层面动态划分 GPU 资源，仅在必要时提供隔离，防止两个阶段之间的干扰，并避免了复制模型的低效率。 AI

影响提高了 LLM 服务效率，可能降低已部署模型的延迟并提高吞吐量。

排序理由该集群包含一篇详细介绍 LLM 服务新技术框架的研究论文。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.LG TIER_1 English(EN) · Lei Gao, Chaoyi Jiang, Hossein Entezari Zarch, Daniel Wong, Mark Hill, Murali Annavaram · 2026-06-02 04:00

DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

arXiv:2511.04791v2 Announce Type: replace Abstract: Modern LLM serving systems must sustain high throughput while meeting strict latency SLOs across two distinct inference phases: compute-intensive prefill and memory-bound decode phases. Existing approaches either (1) aggregate b…

报道来源 [1]

DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

相关实体

相关话题