P-Cast Precision in FP8 Attention: Sink-Induced Collapse and the Optimality of S=2^8
A new research paper analyzes precision challenges in FP8 attention computations, specifically focusing on the softmax probability matrix (P) when cast to FP8. The study identifies an issue called "P-collapse" that occurs with forward KV iteration, leading to underflow of non-sink probability values. Researchers propose a solution involving reverse KV iteration combined with a static scaling factor of S=256 (2^8) to eliminate this underflow and improve output precision. AI
IMPACT This research offers quantitative insights into optimizing FP8 precision for attention mechanisms, potentially improving efficiency in large model training and inference.