PulseAugur
EN
LIVE 20:36:06

Ornith-1.0-35B GGUF model updated with speculative-decode graft

A new version of the Ornith-1.0-35B model, specifically the GGUF format, has been updated with a native Multi Token Prediction (MTP) speculative-decode graft. This update enhances single-stream decode speeds by 1.3-1.35x, achieving up to 233.8 tokens per second. The model maintains a low Kullback–Leibler divergence (KLD) of 0.073, which is better than the Q4_K_M quantization, and offers improved performance for long-context scenarios. AI

IMPACT Enhances local LLM performance and efficiency for users running models on consumer hardware.

RANK_REASON Update to an existing open-source model with performance improvements and new features.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Ornith-1.0-35B GGUF model updated with speculative-decode graft

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/Blahblahblakha ·

    Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)

    <table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1ui4yn6/ornith1035b_gguf_update_native_mtp/"> <img alt="Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)" src="https://preview.redd.it/…