Researchers have reverse-engineered the Metal 4.1 tensor compute path on Apple's M4 Max GPU, revealing that the fp8 matmul2d operation is emulated rather than hardware-accelerated. This means the operation runs on the GPU's shader cores, accumulates in at least fp32 precision, and does not utilize a dedicated matrix datapath or the Apple Neural Engine. The findings, detailed in a paper titled "Rigel," were achieved through empirical characterization and microbenchmarking, leading to the development of a fused kernel that outperforms the decomposed path by up to 12.9%. AI
IMPACT Reveals emulation of key tensor operations on Apple hardware, impacting AI model performance expectations.
RANK_REASON Academic paper detailing empirical characterization of hardware behavior. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →