English(EN) CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

CrossPool 引擎优化稀疏 MoE LLM 的服务

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-23 12:34

研究人员推出 CrossPool，这是一种新颖的服务引擎，旨在高效管理多个稀疏专家混合（MoE）大语言模型（LLM）。该系统解决了托管大量冷模型（请求不频繁但仍消耗大量内存的模型）带来的 GPU 内存挑战。CrossPool 将模型的馈送网络（FFN）权重与其 KV 缓存分离，创建了独立的内存池。这允许跨冷模型整合 FFN 权重，并为活动请求动态分配 KV 缓存，从而提高 GPU 内存利用率并支持更长的上下文。 AI

影响优化了服务多个 LLM 的 GPU 内存使用，可能降低 AI 服务的成本并提高性能。

排序理由该集群包含一篇详细介绍 LLM 服务新技术的学术论文。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Zhuoren Ye, Tianyu Wo, Dinghao Xue, Mingming Zhang, Yuchen Teng, Chunming Hu, Renyu Yang · 2026-06-24 04:00

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

arXiv:2606.24506v1 Announce Type: cross Abstract: Emerging LLM services increasingly host many sparse MoE models, yet most models receive sparse requests and remain cold. This creates a GPU memory problem: model weights are stable and model-determined, while KV-cache is transient…
arXiv cs.AI TIER_1 English(EN) · Renyu Yang · 2026-06-23 12:34

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

Emerging LLM services increasingly host many sparse MoE models, yet most models receive sparse requests and remain cold. This creates a GPU memory problem: model weights are stable and model-determined, while KV-cache is transient and demand-determined. Because cold models rarely…

报道来源 [2]

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

相关实体

相关话题