PulseAugur
实时 23:27:31
English(EN) Realizing Native INT8 Compute for Diffusion Transformers on Consumer GPUs: A Fused INT8 GEMM Kernel for Ideogram 4.0

新的INT8内核加速消费级GPU上的Diffusion Transformer

研究人员开发了一种融合INT8 GEMM内核,可显著加速消费级安培GPU上的Diffusion Transformer。该新内核允许利用硬件的INT8张量核心,克服了之前使INT8比FP8和NF4替代方案慢的软件限制。优化后的内核实现了2.8-4.2倍更快的GEMM操作,并在更高分辨率下提供了约1.1倍的整体图像生成速度提升,使得在单个消费级GPU上生成1024px图像成为可能。 AI

影响 通过优化模型推理,在消费级硬件上实现更快的图像生成。

排序理由 该集群包含一篇学术论文,详细介绍了AI模型推理的新技术优化。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

新的INT8内核加速消费级GPU上的Diffusion Transformer

报道来源 [2]

  1. arXiv cs.LG TIER_1 English(EN) · Ali Asaria, Tony Salomone, Deep Gandhi ·

    Realizing Native INT8 Compute for Diffusion Transformers on Consumer GPUs: A Fused INT8 GEMM Kernel for Ideogram 4.0

    arXiv:2606.14598v1 Announce Type: new Abstract: Post-training INT8 (W8A8) quantization of diffusion transformers is widely deployed as a speed optimization, yet on consumer Ampere GPUs it is frequently slower than the FP8 and NF4 alternatives it is meant to beat. We trace this to…

  2. arXiv cs.LG TIER_1 English(EN) · Deep Gandhi ·

    为消费级GPU上的Diffusion Transformer实现原生INT8计算:Ideogram 4.0的融合INT8 GEMM内核

    Post-training INT8 (W8A8) quantization of diffusion transformers is widely deployed as a speed optimization, yet on consumer Ampere GPUs it is frequently slower than the FP8 and NF4 alternatives it is meant to beat. We trace this to a software artifact: the production "INT8" forw…