Brief · PulseAugur

TOOL · r/LocalLLaMA English(EN) · 3h

sycl : port multi-column MMVQ from CUDA backend (~45% speculative decoding speedup on Intel Arc) by masonmilby · Pull Request #21845 · ggml-org/llama.cpp

A pull request has been submitted to the llama.cpp project to port the multi-column MMVQ (Matrix-Matrix Vector Quantization) from a CUDA backend to SYCL. This port aims to improve performance for users with Intel Arc graphics cards, with initial reports suggesting a speculative decoding speedup of approximately 45%. Users with compatible Intel hardware are advised to update their llama.cpp version to benefit from this optimization. AI

IMPACT Enhances local LLM inference performance on Intel hardware, making it more accessible.

llama.cpp
CUDA
Intel Arc
masonmilby