TetriServe: Efficiently Serving Mixed DiT Workloads
Researchers have developed TetriServe, a novel system designed to efficiently serve Diffusion Transformer (DiT) models, which are computationally intensive for image generation. Traditional serving methods struggle with mixed workloads and strict deadlines, leading to underutilized GPUs and missed Service Level Objectives (SLOs). TetriServe introduces step-level sequence parallelism and a round-based scheduling mechanism to dynamically adjust parallelism for individual requests based on their deadlines, thereby improving SLO attainment and GPU utilization. AI
IMPACT This research could lead to more efficient deployment of generative AI models for image creation, improving user experience and reducing operational costs.