PulseAugur
EN
LIVE 01:53:15

BeeLlama v0.3.1 boosts local LLM performance with DFlash, MTP

BeeLlama v0.3.1, a fork of llama.cpp, has been released with significant performance enhancements. This update integrates features like DFlash, Multi-Threaded Processing (MTP), and new quantization options such as q6_0 cache and TurboQuant. Benchmarks on a single RTX 3090 show substantial speedups, with Qwen 3.6 27B and Gemma 4 31B models achieving up to 177.8 tps, a 4.93x improvement over the baseline. AI

IMPACT Enhances local LLM inference speed and efficiency, enabling more powerful models on consumer hardware.

RANK_REASON This is a software update/fork of an existing project (llama.cpp) with performance improvements and new features, not a novel model release or foundational research.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/Anbeeld ·

    BeeLlama v0.3.1 – latest llama.cpp with extras! DFlash, MTP, q6_0 cache, TurboQuant. Single RTX 3090: Qwen 3.6 27B & Gemma 4 31B up to 177.8 tps (4.93x over baseline)

    <!-- SC_OFF --><div class="md"><p><strong>BeeLlama v0.3.0 and v0.3.1 are here!</strong> Big architectural update to align the fork with upstream llama.cpp and integrate all its additions like MTP and Gemma 4 12B support, while also updating DFlash to handle complex configurations…