Brief · PulseAugur

TOOL · arXiv cs.LG English(EN) · 12h

DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

Researchers have developed DuetServe, a new framework designed to optimize the serving of large language models (LLMs). This system addresses the challenge of balancing high throughput with low latency by intelligently managing the distinct prefill and decode phases of LLM inference. DuetServe dynamically partitions GPU resources at the SM (Streaming Multiprocessor) level to provide isolation only when necessary, preventing interference between the two phases and avoiding the inefficiencies of duplicating models. AI

IMPACT Improves LLM serving efficiency, potentially lowering latency and increasing throughput for deployed models.

LLM
Lei Gao
DuetServe