ML Inference Scheduling with Predictable Latency
A new research paper explores the challenges of scheduling machine learning inference requests to optimize GPU utilization while maintaining predictable latency. The authors identify limitations in existing interference prediction methods, noting that coarse-grained approaches and static models struggle with runtime co-location dynamics and changing workloads, respectively. The paper aims to evaluate these limitations and suggest improvements for more accurate interference prediction in ML inference serving systems. AI
IMPACT Addresses core challenges in optimizing ML inference serving for latency-sensitive applications.