A new research paper analyzes disaggregated inference architectures, which separate prefill and decode phases onto distinct GPU pools. The study provides the first formal game-theoretic analysis of this setup, modeling it as coupled games involving resource allocation, caching, and request routing. The research identifies how GPU saturation impacts the 'Price of Anarchy' (PoA), showing it increases significantly at saturation due to latency and cache externalities. Based on this, an adaptive controller was designed to optimize routing parameters and improve operating points, demonstrating a substantial drop in PoA with a minor throughput cost. AI
IMPACT This research offers insights into optimizing GPU resource allocation for inference, potentially leading to more efficient and cost-effective AI deployments.
RANK_REASON Academic paper published on arXiv detailing a new analysis and controller for disaggregated inference architectures. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →