EvoTensile: Evolutionary algorithms for AMD Tensile GEMM kernel tuning
A new tool called EvoTensile has been developed to optimize the performance of AMD Tensile GEMM kernels, which are crucial for AI model training and inference. EvoTensile utilizes evolutionary algorithms to search for the best parameters, leading to significant speed improvements. For instance, on AMD's Strix Halo (gfx1151) hardware, EvoTensile has tuned NT layout kernels, boosting performance from 20 to 40 TFLOPS, approaching the theoretical roofline. The developer hopes this tool will be integrated into mainstream ROCm libraries for broader adoption. AI
IMPACT Optimized kernels can lead to faster AI model training and inference, potentially reducing computational costs and accelerating development cycles.