Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU
Researchers have reverse-engineered the Metal 4.1 tensor compute path on Apple's M4 Max GPU, revealing that the fp8 matmul2d operation is emulated rather than hardware-accelerated. This means the operation runs on the GPU's shader cores, accumulates in at least fp32 precision, and does not utilize a dedicated matrix datapath or the Apple Neural Engine. The findings, detailed in a paper titled "Rigel," were achieved through empirical characterization and microbenchmarking, leading to the development of a fused kernel that outperforms the decomposed path by up to 12.9%. AI
IMPACT Reveals emulation of key tensor operations on Apple hardware, impacting AI model performance expectations.