llama.cpp adds SYCL backend for Intel Arc GPUs, boosting speed

By PulseAugur Editorial · [1 sources] · 2026-06-05 18:51

A pull request has been submitted to the llama.cpp project to port the multi-column MMVQ (Matrix-Matrix Vector Quantization) from a CUDA backend to SYCL. This port aims to improve performance for users with Intel Arc graphics cards, with initial reports suggesting a speculative decoding speedup of approximately 45%. Users with compatible Intel hardware are advised to update their llama.cpp version to benefit from this optimization. AI

IMPACT Enhances local LLM inference performance on Intel hardware, making it more accessible.

RANK_REASON This is a code contribution to an open-source project that improves hardware compatibility and performance for a specific user group.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

llama.cpp adds SYCL backend for Intel Arc GPUs, boosting speed

COVERAGE [1]

r/LocalLLaMA TIER_1 English(EN) · /u/pmttyji · 2026-06-05 18:51

sycl : port multi-column MMVQ from CUDA backend (~45% speculative decoding speedup on Intel Arc) by masonmilby · Pull Request #21845 · ggml-org/llama.cpp

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1txtuzk/sycl_port_multicolumn_mmvq_from_cuda_backend_45/"> <img alt="sycl : port multi-column MMVQ from CUDA backend (~45% speculative decoding speedup on Intel Arc) by masonmilby · Pull Request #21845 · ggml-…

COVERAGE [1]

sycl : port multi-column MMVQ from CUDA backend (~45% speculative decoding speedup on Intel Arc) by masonmilby · Pull Request #21845 · ggml-org/llama.cpp

RELATED ENTITIES

RELATED TOPICS