Brief · PulseAugur

RESEARCH · arXiv cs.CL English(EN) · 13mo · [60 sources]

FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

Researchers have developed several new methods to accelerate large language model (LLM) inference through speculative decoding. AdaPLD improves retrieval and draft construction by using semantic similarity and branched hypotheses, achieving up to 3.10x speedup. SSSD combines n-gram matching with hardware-aware speculation for up to 2.9x latency reduction without training. D^2SD uses a dual diffusion model and confidence-guided prefix trees to enhance acceptance rates, while TAPS optimizes prefix tree selection for diffusion-drafted decoding, yielding up to 7.9x speedup. KnapSpec treats draft model selection as a knapsack problem to maximize throughput, achieving up to 1.47x speedup, and Vegas uses verification-guided sparse attention for improved decoding throughput. Additionally, LK Losses directly optimize the acceptance rate during training, leading to gains of 8-10% in average acceptance length. AI

IMPACT These advancements in speculative decoding promise significant speedups and efficiency gains for LLM inference, potentially lowering costs and increasing accessibility.