Brief

last 24h

[2/2] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

RESEARCH · Together AI blog English(EN) · 3d · [2 sources]

FlashAttention

Together AI has released FlashAttention-3 and FlashAttention-4, significant upgrades to their GPU-accelerated attention mechanism for large language models. FlashAttention-3, designed for Hopper GPUs, achieves up to 75% utilization and 1.5-2x speedup over its predecessor by exploiting new hardware features like Tensor Cores and Tensor Memory Accelerator, and supporting FP8 precision. FlashAttention-4, optimized for Blackwell GPUs, further enhances performance by pipelining computations and addressing bottlenecks in transcendental functions and memory traffic, reaching 71% utilization and offering substantial speedups over existing libraries. AI

IMPACT These optimized attention mechanisms promise significantly faster LLM training and inference, enabling longer context windows and more efficient GPU utilization.
RESEARCH · Lobsters — AI tag English(EN) · 4d · [3 sources]

Dissecting ThunderKittens, anatomy of a compact DSL for high-performance AI kernels

A new article details ThunderKittens, a compact domain-specific language (DSL) developed at Stanford's Hazy Research Lab for creating high-performance AI kernels. The DSL aims to strike a balance between research productivity and hardware efficiency by abstracting repetitive GPU programming tasks like tile layouts and memory allocation. This allows developers to maintain close reasoning about data movement and scheduling while still enabling performance optimization for modern AI workloads on hardware like NVIDIA's Hopper and Blackwell architectures. AI

IMPACT Enables more efficient AI model training and inference by optimizing low-level GPU kernel performance.
- NVIDIA
- AI
- Stanford
- FlashAttention-2
- Hopper
- PyTorch
- CUDA
- GPU
- Blackwell
- Triton
- Hazy Research Lab
- ThunderKittens

Brief

FlashAttention

Dissecting ThunderKittens, anatomy of a compact DSL for high-performance AI kernels