FlashAttention
Together AI has released FlashAttention-3 and FlashAttention-4, significant upgrades to their GPU-accelerated attention mechanism for large language models. FlashAttention-3, designed for Hopper GPUs, achieves up to 75% utilization and 1.5-2x speedup over its predecessor by exploiting new hardware features like Tensor Cores and Tensor Memory Accelerator, and supporting FP8 precision. FlashAttention-4, optimized for Blackwell GPUs, further enhances performance by pipelining computations and addressing bottlenecks in transcendental functions and memory traffic, reaching 71% utilization and offering substantial speedups over existing libraries. AI
IMPACT These optimized attention mechanisms promise significantly faster LLM training and inference, enabling longer context windows and more efficient GPU utilization.