Two new research papers explore the intricacies of speculative decoding in large language models, a technique used to speed up inference. The first paper identifies a phenomenon called "attention drift" where the model's attention shifts from the prompt to its own generated tokens, proposing architectural changes to mitigate this. The second paper addresses issues with grammar-faithful speculative decoding, showing that current methods sample from an unintended distribution and introducing a "future-validity" statistic to correct this, demonstrating improvements on specific grammar types. AI
Summary written by gemini-2.5-flash-lite from 3 sources. How we write summaries →
IMPACT These papers introduce methods to improve the accuracy and efficiency of speculative decoding, potentially leading to faster and more reliable LLM inference for complex tasks.
RANK_REASON Two academic papers published on arXiv introduce novel findings and techniques related to speculative decoding in LLMs.