BeeLlama v0.3.1 – latest llama.cpp with extras! DFlash, MTP, q6_0 cache, TurboQuant. Single RTX 3090: Qwen 3.6 27B & Gemma 4 31B up to 177.8 tps (4.93x over baseline)
BeeLlama v0.3.1, a fork of llama.cpp, has been released with significant performance enhancements. This update integrates features like DFlash, Multi-Threaded Processing (MTP), and new quantization options such as q6_0 cache and TurboQuant. Benchmarks on a single RTX 3090 show substantial speedups, with Qwen 3.6 27B and Gemma 4 31B models achieving up to 177.8 tps, a 4.93x improvement over the baseline. AI
IMPACT Enhances local LLM inference speed and efficiency, enabling more powerful models on consumer hardware.