Brief · PulseAugur

TOOL · arXiv cs.CL English(EN) · 5h

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

Researchers have reverse-engineered the Metal 4.1 tensor compute path on Apple's M4 Max GPU, revealing that the fp8 matmul2d operation is emulated rather than hardware-accelerated. This means the operation runs on the GPU's shader cores, accumulates in at least fp32 precision, and does not utilize a dedicated matrix datapath or the Apple Neural Engine. The findings, detailed in a paper titled "Rigel," were achieved through empirical characterization and microbenchmarking, leading to the development of a fused kernel that outperforms the decomposed path by up to 12.9%. AI

IMPACT Reveals emulation of key tensor operations on Apple hardware, impacting AI model performance expectations.

Apple
fp16
GELU
GEMM
fp8
Apple Neural Engine
Rigel
Metal 4.1
Ramchand Kumaresan
M4 Max GPU
matmul2d