New research enhances diffusion language model efficiency and scalability
作者PulseAugur 编辑部·[11 个来源]·
Researchers are exploring new methods to improve the efficiency and scalability of diffusion language models (DLMs) for generating long sequences of text. One approach, Block Approximate Sparse Attention (BA-Att), accelerates attention computation by downsampling the attention space, achieving significant speedups while maintaining near full-attention performance. Another development, Dynamic Chunking Diffusion Models (DCDM), replaces fixed positional blocks with content-defined semantic chunks to better capture sequence structure. Additionally, advancements in continuous diffusion models, like RePlaid, demonstrate competitive performance against discrete DLMs, suggesting they are a viable and scalable alternative.
AI
影响
New techniques promise faster and more scalable text generation from diffusion models, potentially enabling longer and more coherent outputs.
排序理由
Multiple arXiv papers detailing novel methods for improving diffusion language models.
arXiv:2602.08404v2 Announce Type: replace Abstract: Diffusion large language models (dLLMs) have recently gained significant attention due to their inherent support for parallel decoding. Building on this paradigm, Mixture-of-Experts (MoE) dLLMs with autoregressive (AR) initializ…
arXiv cs.CL
TIER_1English(EN)·Shubham Parashar, Atharv Chagi, Jacob Helwig, Lakshmi Jotsna, Sushil Vemuri, James Caverlee, Dileep Kalathil, Shuiwang Ji·
arXiv:2605.22939v1 Announce Type: new Abstract: We aim to improve the reasoning capabilities of diffusion language models (DLMs). While SFT is a popular post-training recipe for autoregressive models, its use in DLMs faces challenges and can even hurt performance, though the unde…
arXiv:2603.16077v3 Announce Type: replace Abstract: Masked diffusion models (MDM) exhibit superior generalization when learned using a Partial masking scheme (Prime). This approach converts tokens into sub-tokens and models the diffusion process at the sub-token level. We identif…
We aim to improve the reasoning capabilities of diffusion language models (DLMs). While SFT is a popular post-training recipe for autoregressive models, its use in DLMs faces challenges and can even hurt performance, though the underlying causes remain understudied. Our analysis …
Inference in diffusion large language models (dLLMs) is computationally expensive, as full self-attention must be repeatedly executed at each step of the denoising process without KV cache. Recent sparse attention methods for dLLMs mitigate this cost via block-sparse computation,…
Diffusion Language Models (DLMs) enable globally coherent, bidirectional, and controllable text generation, offering advantages over traditional autoregressive LLMs, while scaling to ultra-long sequences remains costly. Many existing block-sparse attention methods select blocks b…
Discrete diffusion language models (DDLMs) generate text by iteratively denoising categorical token sequences, while recent drifting methods for continuous generators suggest that part of this sampling-time correction can instead be absorbed into training through an anti-symmetri…
Block discrete diffusion language models factorize a sequence autoregressively over fixed-size positional blocks, decoupling within-block parallel denoising from across-block conditioning. We argue that this rigid partition wastes structure already present in the sequence: blocks…
Diffusion Language Models (DLMs) enable globally coherent, bidirectional, and controllable text generation, offering advantages over traditional autoregressive LLMs, while scaling to ultra-long sequences remains costly. Many existing block-sparse attention methods select blocks b…
arXiv:2605.18530v1 Announce Type: cross Abstract: While diffusion has drawn considerable recent attention from the language modeling community, continuous diffusion has appeared less scalable than discrete approaches. To challenge this belief we revisit Plaid, a likelihood-based …
While diffusion has drawn considerable recent attention from the language modeling community, continuous diffusion has appeared less scalable than discrete approaches. To challenge this belief we revisit Plaid, a likelihood-based continuous diffusion language model (DLM), and con…