Brief · PulseAugur

TOOL · r/LocalLLaMA English(EN) · 2w

VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do?

A user on the r/LocalLLaMA subreddit is seeking to combine the speed benefits of vLLM with the quantization capabilities of Unsloth. They are experiencing significantly faster inference speeds with vLLM (5k-10k tokens/sec) compared to standard Llama implementations (800-1000 tokens/sec). However, they are unable to use Unsloth's quantized models, specifically GGUF formats, with vLLM due to compatibility errors. AI

IMPACT Users may find ways to optimize local LLM performance by combining different inference and quantization techniques.

Qwen
Llama
Unsloth
GGUF
vLLM
RTX A6000