Brief

last 24h

[4/4] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · arXiv cs.LG English(EN) · 21h · [2 sources]

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

Researchers have developed ThriftAttention, a novel method to improve the efficiency of long-context attention mechanisms in AI models. This technique selectively uses higher precision (FP16) for critical query-key interactions while performing the majority of computations at a lower, more efficient precision (FP4). By focusing FP16 precision on only about 5% of the most important blocks, ThriftAttention significantly reduces the quality degradation typically seen with low-bit precision in long-context scenarios, recovering nearly 90% of the performance gap compared to full FP16. AI

IMPACT Enhances efficiency for long-context AI models, potentially lowering inference costs and enabling broader application of models with extensive memory.
RESEARCH · Together AI blog English(EN) · 3d · [2 sources]

FlashAttention

Together AI has released FlashAttention-3 and FlashAttention-4, significant upgrades to their GPU-accelerated attention mechanism for large language models. FlashAttention-3, designed for Hopper GPUs, achieves up to 75% utilization and 1.5-2x speedup over its predecessor by exploiting new hardware features like Tensor Cores and Tensor Memory Accelerator, and supporting FP8 precision. FlashAttention-4, optimized for Blackwell GPUs, further enhances performance by pipelining computations and addressing bottlenecks in transcendental functions and memory traffic, reaching 71% utilization and offering substantial speedups over existing libraries. AI

IMPACT These optimized attention mechanisms promise significantly faster LLM training and inference, enabling longer context windows and more efficient GPU utilization.
TOOL · arXiv cs.CV English(EN) · 1w

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

Researchers have developed LongLive-2.0, a parallel infrastructure designed to optimize the training and inference of long video generation models. This system utilizes NVFP4 precision and sequence-parallel autoregressive training to reduce memory requirements and accelerate computations. For inference, LongLive-2.0 employs techniques like W4A4 NVFP4 inference and asynchronous streaming VAE decoding to achieve high throughput, demonstrating up to a 2.15x speedup in training and 1.84x in inference. AI

IMPACT Enables more efficient training and faster inference for long video generation models, potentially leading to wider adoption and new applications.
RESEARCH · Mastodon — mastodon.social English(EN) · 3w · [4 sources]

📰 LongCat Image Edit 2026: 30% Faster Facial Inpainting in Stable Diffusion with Zero Artifacts The LongCat Image Edit model has emerged as a niche but highly e

A new model called LongCat Image Edit 2026, developed by Meituan, demonstrates superior facial inpainting capabilities in Stable Diffusion workflows, achieving results with 30% greater speed and zero artifacts. This model, with its 6 billion parameters, offers natural realism and efficiency, outperforming Stable Diffusion in image editing tasks. Separately, SageAttention kernels are enhancing AI inference speeds by up to 35% on Blackwell GPUs, optimizing attention operations for image and video models. AI

IMPACT New models and optimizations promise faster, more efficient AI image generation and editing.

Brief

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

FlashAttention

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

📰 LongCat Image Edit 2026: 30% Faster Facial Inpainting in Stable Diffusion with Zero Artifacts The LongCat Image Edit model has emerged as a niche but highly e