Modal has released a suite of new speculative decoding models for the Qwen series, aiming to significantly accelerate LLM inference. These models, developed in collaboration with z-Labor and integrated with SGLang, offer an additional 5-20% speedup over existing DFlash speculators. This advancement allows models like Qwen 3.5 122B-A10B to reach over 1000 tokens/sec on high-end hardware, preserving performance on long-context tasks. Modal emphasizes speculative decoding as a critical optimization for LLM inference, capable of delivering substantial speedups compared to traditional kernel optimizations. AI
IMPACT Accelerates LLM inference speed, potentially enabling more interactive and efficient AI applications.
RANK_REASON The item details a new technique (speculative decoding) and its application to specific models (Qwen series), along with performance improvements, which falls under research and infrastructure optimization for LLMs. [lever_c_demoted from research: ic=1 ai=1.0]
- Hugging Face
- LLM Engineer’s Almanac
- Modal
- Nvidia B200
- Qwen 3.5 122B-A10B
- Qwen 3.5 122B-A10B-DFlash
- Qwen 3.5 27B-DFlash
- Qwen 3.5 35B-A3B-DFlash
- Qwen 3.5 397B-A17B
- Qwen 3.5 4B-DFlash
- Qwen 3.5 9B-DFlash
- Qwen 3.6 35B-A3B-DFlash
- SGLang
- vLLM
- z-Labor
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →