Brief · PulseAugur

RESEARCH · arXiv cs.CL English(EN) · 12mo · [8 sources]

FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

Two new research papers, Graft and FlexDraft, introduce advanced techniques for speculative decoding to accelerate large language model inference. Graft combines pruning and retrieval to fill gaps left by pruned branches, achieving significant speedups without training. FlexDraft employs attention tuning and bonus-guided calibration to adapt flexibly across different batch sizes, mitigating draft verification mismatches and improving throughput. These methods aim to overcome the latency-cost trap in LLM deployment by allowing high-quality responses at speeds closer to smaller models. AI

IMPACT These advancements in speculative decoding could significantly reduce LLM inference latency and cost, enabling faster and more efficient deployment of AI applications.

Qwen3-235B
Graft
FlexDraft
Speculative Decoding
vLLM
Claude Sonnet
Llama-3-8B
Llama-3-70B
GPT-4
Ollama