Researchers have developed a fused INT8 GEMM kernel that significantly speeds up diffusion transformers on consumer Ampere GPUs. This new kernel allows the hardware's INT8 tensor cores to be utilized, overcoming a software artifact that previously made INT8 slower than FP8 and NF4 alternatives. The optimized kernel achieves 2.8-4.2x faster GEMM operations and provides an overall ~1.1x speedup for image generation at higher resolutions, making 1024px image generation feasible on a single consumer GPU. AI
IMPACT Enables faster image generation on consumer hardware by optimizing model inference.
RANK_REASON The cluster contains an academic paper detailing a new technical optimization for AI model inference.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →