PulseAugur
LIVE 21:31:39
research · [3 sources] ·
3
research

New methods boost LLM inference speed with hybrid tree construction and flexible decoding

Researchers have developed two new methods to accelerate the inference speed of large language models. One approach, Graft, combines pruning and retrieval to fill gaps left by pruned branches, achieving up to 5.41x speedup on short-context benchmarks and improving over existing methods on larger models. The other method, FlexDraft, uses attention tuning and bonus-guided calibration to adapt to varying batch sizes, mitigating draft verification mismatch and eliminating redundant computation. AI

Summary written by gemini-2.5-flash-lite from 3 sources. How we write summaries →

IMPACT These techniques could significantly reduce the computational cost and latency of running large language models, making them more accessible and efficient for real-world applications.

RANK_REASON Two academic papers introducing novel methods for speculative decoding in LLMs.

Read on arXiv cs.CL →

COVERAGE [3]

  1. arXiv cs.AI TIER_1 · Cong Wang ·

    Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

    Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottl…

  2. arXiv cs.CL TIER_1 · Linfeng Zhang ·

    FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

    Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting …

  3. Hugging Face Daily Papers TIER_1 ·

    FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

    Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting …