PulseAugur / Brief
EN
LIVE 14:38:33

Brief

last 24h
[1/1] 224 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

  1. DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference

    Researchers have developed DriftSched, a framework to improve the efficiency of multi-tenant GPU inference for large language models. This system addresses the challenge of runtime token drift, where actual output lengths deviate from initial estimates, by using adaptive bias correction to reduce estimation errors by over 40%. Experiments showed that the Shortest-Job-First (SJF) scheduling policy, when combined with DriftSched, significantly reduced latency, with median end-to-end latency dropping by approximately 42%. The framework also includes a mechanism for runtime feedback-driven drift compensation and a benchmarking suite for evaluating QoS-aware scheduling on shared GPU infrastructure. AI

    IMPACT Optimizes GPU resource utilization for LLM inference, potentially lowering costs and improving service responsiveness.