Apple M4 Max GPU's Tensor Compute Path Emulated, Not Accelerated

By PulseAugur Editorial · [1 sources] · 2026-06-12 04:00

Researchers have reverse-engineered the Metal 4.1 tensor compute path on Apple's M4 Max GPU, revealing that the fp8 matmul2d operation is emulated rather than hardware-accelerated. This means the operation runs on the GPU's shader cores, accumulates in at least fp32 precision, and does not utilize a dedicated matrix datapath or the Apple Neural Engine. The findings, detailed in a paper titled "Rigel," were achieved through empirical characterization and microbenchmarking, leading to the development of a fused kernel that outperforms the decomposed path by up to 12.9%. AI

IMPACT Reveals emulation of key tensor operations on Apple hardware, impacting AI model performance expectations.

RANK_REASON Academic paper detailing empirical characterization of hardware behavior. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

paper
infra

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Apple M4 Max GPU's Tensor Compute Path Emulated, Not Accelerated

COVERAGE [1]

arXiv cs.CL TIER_1 English(EN) · Ramchand Kumaresan · 2026-06-12 04:00

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

arXiv:2606.12765v1 Announce Type: new Abstract: Apple's Metal 4.1 exposes a tensor compute path: the Metal Performance Primitives (MPP) matmul2d operation over cooperative_tensor fragments, whose interface is documented but whose hardware behavior is deliberately hidden. The spec…

COVERAGE [1]

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

RELATED ENTITIES

RELATED TOPICS