ENTITY Flashattention

Flashattention

PulseAugur coverage of Flashattention — every cluster mentioning Flashattention across labs, papers, and developer communities, ranked by signal.

Show in brief

Total · 30d

26 over 90d

Releases · 30d

0 over 90d

Papers · 30d

19 over 90d

TIER MIX · 90D

significant 1
research 14
tool 10
commentary 1

TOPICS

SENTIMENT · 30D

6 day(s) with sentiment data

RECENT · PAGE 1/2 · 26 TOTAL

RESEARCH · CL_108502 · Jun 24 · 10:18

New EpiKV method optimizes LLM KV cache, boosting efficiency and context length

A new research paper introduces EpiKV, a method for optimizing KV cache eviction in large language models. Unlike previous methods that rely on attention weights, EpiKV uses an "epiphany score" derived from changes in t…
RESEARCH · CL_112712 · Jun 23 · 11:38

New book details modern GPU programming for AI workloads

A new book titled "Modern GPU Programming for MLSys" aims to demystify high-performance GPU kernel development for machine learning systems. The book, originating from Carnegie Mellon University's Machine Learning Syste…
TOOL · CL_101987 · Jun 20 · 19:05

Free 15-part series explains LLM internals with Gemma 4 12B

A 15-part series delves into the inner workings of Large Language Models, using Gemma 4 12B as a practical example. The series covers topics from tokenization and tensor shapes to inference, memory constraints, and fine…
SIGNIFICANT · CL_101878 · Jun 20 · 16:56

Subquadratic unveils SubQ LLM for single-pass codebase processing

Subquadratic Inc. has unveiled SubQ, a new long-context language model that claims to process entire codebases or document sets in a single pass. The model utilizes a subquadratic, sparse-attention design, which theoret…
TOOL · CL_91640 · Jun 15 · 09:16

Flash-KMeans accelerates GPU k-means clustering over 200x

Researchers from UC Berkeley and UT Austin have developed Flash-KMeans, an open-source library that significantly accelerates the k-means clustering algorithm for modern AI pipelines. By optimizing data movement on GPUs…
TOOL · CL_79834 · Jun 9 · 04:00

Math framework slashes transformer memory use, boosts speed

Researchers have developed a new framework called Mathematics of Arrays (MoA) to optimize transformer kernels, which are computationally intensive components of modern AI models. This framework uses algebraic constructi…
TOOL · CL_62826 · Jun 1 · 04:00

New bias method enables faster, scalable Super-Resolution Transformers

Researchers have developed a new method called Rank-factorized Implicit Neural Bias (RIB) to improve the efficiency of Super-Resolution Transformers. This technique allows these models to utilize hardware-accelerated ke…
RESEARCH · CL_55666 · May 27 · 00:00

OSP-Next video model achieves 83.73% VBench score with efficiency gains

Researchers have introduced OSP-Next, a novel text-to-video generation model designed for enhanced efficiency and quality. The model integrates sparse attention mechanisms, a novel Sparse Sequence Parallelism (SSP) tech…
RESEARCH · CL_48931 · May 22 · 15:23

New technique slashes I/O costs for LLM attention mechanisms

Researchers have developed a new technique to significantly reduce the I/O complexity of attention mechanisms in large language models. This method aims to minimize data transfers between fast and slow memory, a critica…
RESEARCH · CL_35013 · May 16 · 22:23

Nous Research's Lighthouse Attention speeds up LLM pretraining

Researchers at Nous Research have developed Lighthouse Attention, a novel hierarchical attention mechanism designed to accelerate the pretraining of large language models with long contexts. This method achieves a 1.4x …
RESEARCH · CL_44749 · May 16 · 00:00

New research tackles attention mechanism limitations in transformers

Researchers are exploring novel approaches to enhance the efficiency and effectiveness of attention mechanisms in transformers. Several papers introduce methods to mitigate issues like over-smoothing and computational b…
RESEARCH · CL_36554 · May 15 · 06:56

New research tackles diffusion language model limitations

Researchers are exploring new methods to improve diffusion language models (DLMs), which offer faster inference than autoregressive models. Several recent papers introduce techniques to enhance DLM performance, includin…
TOOL · CL_31732 · May 14 · 13:24

Guide details building FlashAttention wheel file for ML integration

This article provides a guide on how to build and install version 2.8.3 of FlashAttention. It focuses on the technical process of creating a wheel file, which is a standard distribution format for Python packages. The g…
RESEARCH · CL_34499 · May 11 · 20:03

New attention methods tackle LLM long-context challenges

Researchers are developing new attention mechanisms to handle increasingly long contexts in large language models. One approach, Runtime-Certified Bounded-Error Quantized Attention, uses tiered KV caches to compress mem…
RESEARCH · CL_14450 · May 4 · 01:57

Researchers explore novel attention mechanisms and optimization techniques for LLMs

Researchers are exploring novel attention mechanisms to overcome the quadratic complexity of standard self-attention in transformers, particularly for long-context processing. Several papers introduce methods like Light…
RESEARCH · CL_10154 · Apr 30 · 04:00

OVGGT achieves constant-cost streaming for 3D geometry reconstruction

Researchers have introduced OVGGT, a novel framework designed for reconstructing 3D geometry from streaming video with constant memory and compute costs. This training-free approach addresses the limitations of previous…
RESEARCH · CL_10106 · Apr 30 · 04:00

Focus method enhances LLM attention efficiency without performance loss

Researchers have developed a new method called Focus, designed to improve the efficiency of attention mechanisms in large language models. Standard attention scales quadratically with sequence length, leading to high co…
RESEARCH · CL_06527 · Apr 28 · 04:00

New methods QFlash and ELSA boost Vision Transformer attention efficiency

Researchers have developed two new methods to improve the efficiency of attention mechanisms in vision transformers. QFlash focuses on enabling integer-only operations for FlashAttention, achieving significant speedups …
SIGNIFICANT · CL_05912 · Apr 27 · 22:38

Together AI powers national scientific mission with open-source infrastructure

Together, an open-source AI lab, has announced its participation in the Genesis Mission, a project aimed at doubling American scientific productivity over the next decade. The initiative connects supercomputers, experim…
TOOL · CL_47658 · Apr 1 · 00:00

Together AI kernels team optimizes GPUs with FlashAttention

The Together AI kernels team, including researchers Dan Fu and Tri Dao, developed FlashAttention, a software layer that significantly optimizes GPU performance for AI models. This breakthrough, achieved by applying data…

New EpiKV method optimizes LLM KV cache, boosting efficiency and context length

New book details modern GPU programming for AI workloads

Free 15-part series explains LLM internals with Gemma 4 12B

Subquadratic unveils SubQ LLM for single-pass codebase processing

Flash-KMeans accelerates GPU k-means clustering over 200x

Math framework slashes transformer memory use, boosts speed

New bias method enables faster, scalable Super-Resolution Transformers

OSP-Next video model achieves 83.73% VBench score with efficiency gains

New technique slashes I/O costs for LLM attention mechanisms

Nous Research's Lighthouse Attention speeds up LLM pretraining

New research tackles attention mechanism limitations in transformers

New research tackles diffusion language model limitations

Guide details building FlashAttention wheel file for ML integration

New attention methods tackle LLM long-context challenges

Researchers explore novel attention mechanisms and optimization techniques for LLMs

OVGGT achieves constant-cost streaming for 3D geometry reconstruction

Focus method enhances LLM attention efficiency without performance loss

New methods QFlash and ELSA boost Vision Transformer attention efficiency

Together AI powers national scientific mission with open-source infrastructure

Together AI kernels team optimizes GPUs with FlashAttention