FOCUS: DLLMs Know How to Tame Their Compute Bound
Researchers have developed a new inference system called FOCUS designed to improve the efficiency of Diffusion Large Language Models (DLLMs). This system addresses the high decoding costs associated with DLLMs by dynamically focusing computation on the most relevant tokens, rather than wasting resources on non-decodable ones. FOCUS can achieve up to a 3.52x throughput improvement in large-batch scenarios while maintaining or enhancing generation quality. AI
IMPACT Optimizes inference for Diffusion LLMs, potentially lowering deployment costs and increasing accessibility.