Researchers have developed two new methods to accelerate the inference speed of large language models. One approach, Graft, combines pruning and retrieval to fill gaps left by pruned branches, achieving up to 5.41x speedup on short-context benchmarks and improving over existing methods on larger models. The other method, FlexDraft, uses attention tuning and bonus-guided calibration to adapt to varying batch sizes, mitigating draft verification mismatch and eliminating redundant computation. AI
Summary written by gemini-2.5-flash-lite from 3 sources. How we write summaries →
IMPACT These techniques could significantly reduce the computational cost and latency of running large language models, making them more accessible and efficient for real-world applications.
RANK_REASON Two academic papers introducing novel methods for speculative decoding in LLMs.