Diffusion Transformers advance image generation and material transfer

By PulseAugur Editorial · [11 sources] · 2026-05-15 06:31

Researchers have introduced several advancements in Diffusion Transformer (DiT) architectures for image generation and manipulation. One paper explores the use of register tokens in pixel-space DiTs to improve convergence and generation quality, finding they produce cleaner feature maps. Another proposes HyperDiT, which uses hyper-connected cross-scale interactions and registers to bridge semantic and pixel manifolds for high-fidelity generation. ElasticDiT focuses on efficiency for mobile devices by dynamically adjusting architecture and using sparse attention, while DreamSR enhances super-resolution by combining global and local textual features. Finally, DealMaTe and MaTe simplify material transfer by eliminating text guidance and relying on image inputs within DiT frameworks. AI

IMPACT These advancements in Diffusion Transformers offer improved image generation fidelity, efficiency for mobile devices, and new capabilities in super-resolution and material transfer.

RANK_REASON Multiple research papers published on arXiv detailing new architectures and techniques for Diffusion Transformers.

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 11 sources. How we write summaries →

Diffusion Transformers advance image generation and material transfer

COVERAGE [11]

arXiv cs.CV TIER_1 English(EN) · Yunhai Tong · 2026-05-20 17:59

One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration

Discrete diffusion models excel at visual synthesis but rely on slow, iterative decoding. Existing single-step distillation methods attempt to bypass this bottleneck, either by training auxiliary score networks that effectively double compute, or by introducing specialized parame…
arXiv stat.ML TIER_1 English(EN) · Lifu Wei, Yinuo Ren, Naichen Shi, Yiping Lu · 2026-05-19 04:00

SURGE: Approximation-free Training Free Particle Filter for Diffusion Surrogate

arXiv:2605.18745v1 Announce Type: new Abstract: Diffusion-based generative models increasingly rely on inference-time guidance, adding a drift term or reweighting mixture of experts, to improve sample quality on task-specific objectives. However, most existing techniques require …
arXiv stat.ML TIER_1 English(EN) · Yiping Lu · 2026-05-18 17:59

SURGE: Approximation-free Training Free Particle Filter for Diffusion Surrogate

Diffusion-based generative models increasingly rely on inference-time guidance, adding a drift term or reweighting mixture of experts, to improve sample quality on task-specific objectives. However, most existing techniques require repeated score or gradient evaluations, introduc…
arXiv cs.CV TIER_1 English(EN) · Haohuan Fu · 2026-05-18 07:35

Learning to Balance: Decoupled Siamese Diffusion Transformer for Reference-Based Remote Sensing Image Super-Resolution

Diffusion-based methods demonstrate significant potential for remote sensing image super-resolution at large scaling factors, particularly in reference-based super-resolution (RefSR) where high-resolution reference images provide critical fine-grained texture priors. However, exi…
arXiv cs.CV TIER_1 English(EN) · Yan Li · 2026-05-18 02:25

FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion

To circumvent the inherent fidelity bottlenecks and optimization misalignment of VAE-based latent diffusion, pixel-space diffusion models have emerged as a compelling end-to-end paradigm. However, existing pixel diffusion models often struggle to balance computational efficiency …
arXiv cs.CV TIER_1 English(EN) · Dmitry Baranchuk · 2026-05-15 16:27

Registers Matter for Pixel-Space Diffusion Transformers

Vision Transformers (ViTs) are known to exhibit high-norm patch-token outliers that degrade feature map quality, a problem effectively mitigated by \textit{register tokens}. As diffusion models increasingly adopt transformer architectures and move toward pixel-space training, the…
arXiv cs.CV TIER_1 English(EN) · Yan Li · 2026-05-15 08:51

HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

Pixel-space diffusion models bypass the reconstruction bottleneck of Variational Autoencoders (VAEs) but face a fundamental "granularity dilemma": capturing global semantics favors large patch scales, while generating high-fidelity details demands fine-grained inputs. To address …
arXiv cs.CV TIER_1 English(EN) · Xinghao Chen · 2026-05-15 07:13

ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

The Diffusion Transformer (DiT) architecture is the state-of-the-art paradigm for high-fidelity image generation, underpinning models like Stable Diffusion-3 and FLUX.1. However, deploying these models on resource-constrained mobile devices entails prohibitive computational and m…
arXiv cs.CV TIER_1 English(EN) · Yitong Wang · 2026-05-15 07:08

DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer

Large-scale pre-trained diffusion models have been extensively adopted for real-world image Super-Resolution because of their powerful generative priors through textual guidance. However, when super-resolving high-resolution images with patch-wise inference strategy, most existin…
arXiv cs.CV TIER_1 English(EN) · Zitong Yu · 2026-05-15 07:06

DealMaTe: Multi-Dimensional Material Transfer via Diffusion Transformer

Recently, diffusion-based material transfer methods rely on image fine-tuning or complex architectures with auxiliary networks but face challenges such as text dependency, additional computational costs, and feature misalignment. To address these limitations, we propose \textbf{D…
arXiv cs.CV TIER_1 English(EN) · Xiu Li · 2026-05-15 06:31

MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer

Recent diffusion-based methods for material transfer rely on image fine-tuning or complex architectures with assistive networks, but face challenges including text dependency, extra computational costs, and feature misalignment. To address these limitations, we propose MaTe, a st…

COVERAGE [11]

RELATED ENTITIES

RELATED TOPICS