Brief · PulseAugur

RESEARCH · arXiv cs.LG Deutsch(DE) · 1w · [2 sources]

DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference

Researchers have developed DriftSched, a framework to improve the efficiency of multi-tenant GPU inference for large language models. This system addresses the challenge of runtime token drift, where actual output lengths deviate from initial estimates, by using adaptive bias correction to reduce estimation errors by over 40%. Experiments showed that the Shortest-Job-First (SJF) scheduling policy, when combined with DriftSched, significantly reduced latency, with median end-to-end latency dropping by approximately 42%. The framework also includes a mechanism for runtime feedback-driven drift compensation and a benchmarking suite for evaluating QoS-aware scheduling on shared GPU infrastructure. AI

IMPACT Optimizes GPU resource utilization for LLM inference, potentially lowering costs and improving service responsiveness.

Meta
CPU
Qwen3 0.6B
eBPF
Linux
cuda_sched_trace
sched_ext
stress-ng
GPU
LLM
NVIDIA L4 GPUs
vLLM
FIFO
DriftSched
QoS
Kathiravan Palaniappan
Qwen 3 0.6B