Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs.
A pull request for llama.cpp, which was denied for inclusion in the main project, offers a performance boost for Mixture of Experts (MoE) models on Strix Halo hardware. This modification, developed by pedapudi, can increase processing speed by up to 30%, particularly at lower context lengths. Users can manually apply these small code changes to their local llama.cpp builds to achieve these gains. AI
IMPACT Manual application of a code tweak can yield significant performance gains for specific model architectures on certain hardware.