New MRP technique boosts language model speed and accuracy

By PulseAugur Editorial · [1 sources] · 2026-07-01 00:00

Researchers from Modal Research and NYU Shanghai's HeavyBall Research have developed a new technique called Multi-Token Residual Prediction (MRP) that enhances the speed and accuracy of language models. MRP works by training a small module to predict the residual difference between adjacent denoising steps in diffusion language models, rather than the full distribution. This approach allows for faster decoding with minimal quality loss in a static regime, achieving up to 1.56x throughput, and recovers significant accuracy points lost in aggressive low-threshold decoding settings in a dynamic regime. AI

IMPACT This research could lead to faster and more accurate language model inference, benefiting applications that rely on real-time text generation.

RANK_REASON The item describes a new research method for improving language model inference speed and accuracy, including a paper and code release. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Modal blog →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New MRP technique boosts language model speed and accuracy

COVERAGE [1]

Modal blog TIER_1 English(EN) · 2026-07-01 00:00

Multi-token Residual Prediction

One tiny module, two ways to win on Diffusion LMs

COVERAGE [1]

Multi-token Residual Prediction

RELATED ENTITIES

RELATED TOPICS