Approaching I/O-optimality for Approximate Attention
Researchers have developed a new technique to significantly reduce the I/O complexity of attention mechanisms in large language models. This method aims to minimize data transfers between fast and slow memory, a critical factor in the efficiency of these models. The new approach achieves an almost-linear I/O cost with respect to the input size, a substantial improvement over existing quadratic costs, and is inspired by recent approximate attention frameworks. AI
IMPACT Reduces computational overhead for attention, potentially enabling larger models or faster inference.