English(EN)Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models
Lens模型训练高效,RankE框架改进离散文本到图像生成
作者PulseAugur 编辑部·[11 个来源]·
研究人员推出了Lens,一个拥有38亿参数的文本到图像模型,它使用密集的字幕数据集和高效的架构,以显著减少的训练计算量实现了与更大模型相媲美的性能。它能快速生成高分辨率图像,并支持多语言提示。另外,一个名为RankE的新框架已被开发用于离散文本到图像模型,该框架联合优化生成器和解码器,以同时提高对齐度和图像保真度,解决了潜在协变量偏移的问题。
AI
arXiv:2510.22827v3 Announce Type: replace-cross Abstract: Evaluating text-to-image (T2I) systems requires judging not only whether an image matches a prompt, but also whether socially salient attributes are represented faithfully and without unsupported inference. Existing automa…
Lens is a compact 3.8B-parameter text-to-image model achieving superior performance with reduced training compute through dense caption datasets, multi-resolution batching, efficient architecture, and optimization techniques.
Discrete autoregressive text-to-image models suffer from latent covariate shift during policy optimization, which RankE addresses through end-to-end co-evolution of policy and decoder components.
arXiv cs.CV
TIER_1English(EN)·Shizhan Liu, Hao Zheng, Hang Yu, Jianguo Li·
arXiv:2503.01122v2 Announce Type: replace Abstract: Image personalization has garnered attention for its ability to customize Text-to-Image generation using only a few reference images. However, a key challenge in image personalization is the issue of conceptual coupling, where t…
arXiv:2605.25876v1 Announce Type: new Abstract: With the continued advancement of text-to-image (T2I) generation, producing high-quality images is becoming increasingly attainable; consequently, user demands are shifting toward images that better satisfy their specific requiremen…
arXiv:2605.25763v1 Announce Type: new Abstract: Text-to-image synthesis has made significant progress, benefiting from the strong generative capabilities of diffusion models. However, these models struggle to achieve precise text-to-image alignment within cross-attention maps dur…
With the continued advancement of text-to-image (T2I) generation, producing high-quality images is becoming increasingly attainable; consequently, user demands are shifting toward images that better satisfy their specific requirements. As reward models play an increasingly import…
Text-to-image synthesis has made significant progress, benefiting from the strong generative capabilities of diffusion models. However, these models struggle to achieve precise text-to-image alignment within cross-attention maps during the denoising process. Existing works primar…
arXiv:2605.21573v1 Announce Type: new Abstract: We introduce Lens, a 3.8B-parameter T2I model that achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6B parameters across various benchmarks, while requiring significantly…
Discrete autoregressive (AR) text-to-image (T2I) models pair a VQ tokenizer with an AR policy, and current post-training pipelines optimize only the policy while keeping the VQ decoder frozen. Recent diffusion T2I work, exemplified by REPA-E, has shown that the VAE itself constit…