PulseAugur
EN
LIVE 13:53:27

CrossPool engine optimizes serving for sparse MoE LLMs

Researchers have introduced CrossPool, a novel serving engine designed to efficiently manage multiple sparse Mixture-of-Experts (MoE) Large Language Models (LLMs). The system addresses the GPU memory challenge posed by hosting numerous cold models, which are models that receive infrequent requests but still consume significant memory. CrossPool separates the model's feed-forward network (FFN) weights from its KV-cache, creating distinct memory pools. This allows for consolidated FFN weights across cold models and dynamic allocation of KV-cache to active requests, thereby improving GPU memory utilization and supporting longer contexts. AI

IMPACT Optimizes GPU memory usage for serving multiple LLMs, potentially reducing costs and improving performance for AI services.

RANK_REASON The cluster contains a research paper detailing a new technical approach for serving LLMs.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

CrossPool engine optimizes serving for sparse MoE LLMs

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Zhuoren Ye, Tianyu Wo, Dinghao Xue, Mingming Zhang, Yuchen Teng, Chunming Hu, Renyu Yang ·

    CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

    arXiv:2606.24506v1 Announce Type: cross Abstract: Emerging LLM services increasingly host many sparse MoE models, yet most models receive sparse requests and remain cold. This creates a GPU memory problem: model weights are stable and model-determined, while KV-cache is transient…

  2. arXiv cs.AI TIER_1 English(EN) · Renyu Yang ·

    CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

    Emerging LLM services increasingly host many sparse MoE models, yet most models receive sparse requests and remain cold. This creates a GPU memory problem: model weights are stable and model-determined, while KV-cache is transient and demand-determined. Because cold models rarely…