Brief · PulseAugur

RESEARCH · arXiv cs.AI English(EN) · 4d · [2 sources]

CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference

Researchers have developed a new method called Collocation-Length Prediction (CLP) to accelerate large language model inference. CLP addresses a core issue in multi-token prediction (MTP) where the prediction head for subsequent tokens interferes with the main language model head, causing quality degradation. By redesigning the architecture so the main head always generates the first token and a lightweight CLP layer predicts subsequent tokens, the method achieves significant speedups without sacrificing output quality. Experiments on Qwen2.5 models demonstrated speed increases of up to 1.29x with negligible repetition. AI

IMPACT Introduces a novel, lightweight approach to accelerate LLM inference, potentially reducing computational costs and latency for real-time applications.

large language model
Qwen2.5
multi-token prediction
autoregressive decoding
language model head