PulseAugur
LIVE 07:34:45
research · [1 source] ·
0
research

New ML-based GPU caching algorithm LCR boosts LLM inference speed

Researchers have developed a new GPU caching algorithm called Learning-Augmented LRU (LALRU) designed to improve efficiency during AI inference. This algorithm integrates learned predictions with caching policies to ensure both near-optimality with accurate predictions and bounded performance degradation with inaccurate ones. A practical implementation named LCR, built upon LALRU, demonstrated significant improvements in LLM workloads, reducing P99 time-to-first-token by up to 28.3%, and boosting throughput by up to 24.2% for DLRM workloads. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Improves inference efficiency and throughput for LLM and DLRM workloads, potentially lowering operational costs.

RANK_REASON Academic paper introducing a new algorithm for GPU caching in AI inference.

Read on arXiv cs.LG →

COVERAGE [1]

  1. arXiv cs.LG TIER_1 · Peng Chen, Jiaji Zhang, Hailiang Zhao, Yirong Zhang, Shenyao Chen, Jiahong Yu, Xueyan Tang, Yixuan Wang, Hao Li, Jianping Zou, Gang Xiong, Kingsum Chow, Shuibing He, Shuiguang Deng ·

    Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

    arXiv:2509.20979v2 Announce Type: replace Abstract: In modern GPU inference, cache efficiency remains a major bottleneck, and heuristic policies such as \textsc{LRU} can perform far worse than the offline optimum. Existing learning-based caching systems improve hit rates mainly t…