Researchers have developed HELM, a system designed to optimize the performance of generative recommender models by dynamically managing High Bandwidth Memory (HBM) allocation between embedding (EMB) and KV caches. Existing methods often fail to adapt to shifting workload demands, leading to significant latency improvements being missed. HELM utilizes a PPO-based controller for adaptive memory allocation and an EMB-KV-aware scheduler to jointly manage HBM and request routing, achieving substantial reductions in P99 latency. AI
影响 Optimizes serving infrastructure for generative recommenders, potentially reducing latency and improving user experience.
排序理由 This is a research paper detailing a novel system for optimizing recommender model serving. [lever_c_demoted from research: ic=1 ai=1.0]
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →