Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
A new paper analyzes why training transformer models with low-precision formats and Flash Attention can lead to training instabilities and loss explosion. The research identifies two key factors: the emergence of similar low-rank representations within the attention mechanism and the compounding effect of biased rounding errors in low-precision arithmetic. These phenomena create a cycle of error accumulation that corrupts weight updates. The authors propose a minor modification to Flash Attention that mitigates rounding bias, stabilizing training and confirming their analysis. AI
IMPACT Provides a mechanistic explanation for low-precision training failures with Flash Attention, offering a practical solution to improve stability.