Brief

last 24h

[2/2] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

RESEARCH · Together AI blog English(EN) · 3d · [2 sources]

FlashAttention

Together AI has released FlashAttention-3 and FlashAttention-4, significant upgrades to their GPU-accelerated attention mechanism for large language models. FlashAttention-3, designed for Hopper GPUs, achieves up to 75% utilization and 1.5-2x speedup over its predecessor by exploiting new hardware features like Tensor Cores and Tensor Memory Accelerator, and supporting FP8 precision. FlashAttention-4, optimized for Blackwell GPUs, further enhances performance by pipelining computations and addressing bottlenecks in transcendental functions and memory traffic, reaching 71% utilization and offering substantial speedups over existing libraries. AI

IMPACT These optimized attention mechanisms promise significantly faster LLM training and inference, enabling longer context windows and more efficient GPU utilization.
RESEARCH · MarkTechPost English(EN) · 1w · [3 sources]

NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon

NVIDIA has developed a new 4-bit pretraining methodology called NVFP4, designed to overcome the challenges of reduced dynamic range and increased quantization error in narrower floating-point formats. This method was successfully validated by pretraining a 12-billion-parameter hybrid Mamba-Transformer model on 10 trillion tokens, marking the longest publicly documented training run in 4-bit precision to date. The resulting model achieved performance nearly identical to an FP8 baseline on the MMLU-Pro benchmark, demonstrating the viability of NVFP4 for large-scale model training. AI

IMPACT Enables more efficient training of large language models by reducing precision requirements without significant performance loss.

Brief

FlashAttention

NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon