PulseAugur
LIVE 23:30:01
research · [2 sources] ·

Together AI releases FlashAttention-3 and -4 for faster LLM processing

Together AI has released FlashAttention-3 and FlashAttention-4, significant upgrades to their GPU-accelerated attention mechanism for large language models. FlashAttention-3, designed for Hopper GPUs, achieves up to 75% utilization and 1.5-2x speedup over its predecessor by exploiting new hardware features like Tensor Cores and Tensor Memory Accelerator, and supporting FP8 precision. FlashAttention-4, optimized for Blackwell GPUs, further enhances performance by pipelining computations and addressing bottlenecks in transcendental functions and memory traffic, reaching 71% utilization and offering substantial speedups over existing libraries. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT These optimized attention mechanisms promise significantly faster LLM training and inference, enabling longer context windows and more efficient GPU utilization.

RANK_REASON The cluster describes new algorithmic techniques and software releases (FlashAttention-3 and -4) for optimizing attention mechanisms on specific GPU architectures, detailing performance improvements and hardware feature utilization.

Read on Together AI blog →

Together AI releases FlashAttention-3 and -4 for faster LLM processing

COVERAGE [2]

  1. Together AI blog TIER_1 ·

    FlashAttention

    FlashAttention-3 achieves up to 75% GPU utilization on H100s, making AI models up to 2x faster and enabling efficient processing of longer text inputs. It allows for faster training and inference of LLMs, supports lower precision operations for improved efficiency.

  2. Together AI blog TIER_1 ·

    FlashAttention

    As GPU throughput outpaces memory bandwidth, kernels must evolve. We introduce FlashAttention-4, featuring new pipelining for maximum overlap, 2-CTA MMA modes to reduce shared memory traffic, and a hardware-software hybrid approach to softmax exponentials.