PulseAugur
EN
LIVE 21:47:17

New INT8 Kernel Accelerates Diffusion Transformers on Consumer GPUs

Researchers have developed a fused INT8 GEMM kernel that significantly speeds up diffusion transformers on consumer Ampere GPUs. This new kernel allows the hardware's INT8 tensor cores to be utilized, overcoming a software artifact that previously made INT8 slower than FP8 and NF4 alternatives. The optimized kernel achieves 2.8-4.2x faster GEMM operations and provides an overall ~1.1x speedup for image generation at higher resolutions, making 1024px image generation feasible on a single consumer GPU. AI

IMPACT Enables faster image generation on consumer hardware by optimizing model inference.

RANK_REASON The cluster contains an academic paper detailing a new technical optimization for AI model inference.

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New INT8 Kernel Accelerates Diffusion Transformers on Consumer GPUs

COVERAGE [2]

  1. arXiv cs.LG TIER_1 English(EN) · Ali Asaria, Tony Salomone, Deep Gandhi ·

    Realizing Native INT8 Compute for Diffusion Transformers on Consumer GPUs: A Fused INT8 GEMM Kernel for Ideogram 4.0

    arXiv:2606.14598v1 Announce Type: new Abstract: Post-training INT8 (W8A8) quantization of diffusion transformers is widely deployed as a speed optimization, yet on consumer Ampere GPUs it is frequently slower than the FP8 and NF4 alternatives it is meant to beat. We trace this to…

  2. arXiv cs.LG TIER_1 English(EN) · Deep Gandhi ·

    Realizing Native INT8 Compute for Diffusion Transformers on Consumer GPUs: A Fused INT8 GEMM Kernel for Ideogram 4.0

    Post-training INT8 (W8A8) quantization of diffusion transformers is widely deployed as a speed optimization, yet on consumer Ampere GPUs it is frequently slower than the FP8 and NF4 alternatives it is meant to beat. We trace this to a software artifact: the production "INT8" forw…