PulseAugur
EN
LIVE 02:41:40

Apple researchers propose cache sharing to reduce LLM serving costs

Apple Machine Learning Research has published a paper detailing a new method called Stochastic KV Routing to reduce the memory footprint of transformer language models. This technique focuses on optimizing the depth dimension of the KV cache, rather than just temporal compression or eviction. By training layers to randomly attend to preceding layers' KV states, the model becomes adaptable to various cache-sharing strategies without information loss, potentially preserving or improving performance while significantly cutting memory usage. AI

IMPACT Introduces a novel technique for reducing KV cache memory in LLMs, potentially lowering serving costs and enabling longer context windows.

RANK_REASON The cluster contains a research paper published by Apple's ML Research group detailing a novel method for optimizing LLM inference. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Apple Machine Learning Research →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Apple researchers propose cache sharing to reduce LLM serving costs

COVERAGE [1]

  1. Apple Machine Learning Research TIER_1 English(EN) ·

    Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

    Serving transformer language models with high throughput requires caching Key-Values (KVs) to avoid redundant computation during autoregressive generation. The memory footprint of KV caching is significant and heavily impacts serving costs. This work proposes to lessen these memo…