HELM system optimizes GPU HBM for generative recommender latency

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have developed HELM, a system designed to optimize the performance of generative recommender models by dynamically managing High Bandwidth Memory (HBM) allocation between embedding (EMB) and KV caches. Existing methods often fail to adapt to shifting workload demands, leading to significant latency improvements being missed. HELM utilizes a PPO-based controller for adaptive memory allocation and an EMB-KV-aware scheduler to jointly manage HBM and request routing, achieving substantial reductions in P99 latency. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Optimizes serving infrastructure for generative recommenders, potentially reducing latency and improving user experience.

RANK_REASON This is a research paper detailing a novel system for optimizing recommender model serving. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

paper
infra

COVERAGE [1]

arXiv cs.LG TIER_1 · Wenjun Yu, Shuguang Han, Amelie Chi Zhou · 2026-05-07 04:00

One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

arXiv:2605.04450v1 Announce Type: cross Abstract: Generative Recommender (GR) inference places embedding hot caches (EMB) and KV caches in direct competition for limited GPU HBM: allocating more memory to one improves its efficiency but degrades the other. Existing systems optimi…

COVERAGE [1]

One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

RELATED ENTITIES

RELATED TOPICS