ViTok-v2 scales to 5B parameters, advancing image autoencoder reconstruction and generation

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have introduced ViTok-v2, a 5-billion parameter image autoencoder that scales to larger resolutions and parameter counts than previous models. This new model utilizes native resolution support and a DINOv3 perceptual loss to achieve better reconstruction quality across various image sizes. ViTok-v2 was trained on approximately 2 billion images and demonstrates improved performance at higher resolutions compared to existing methods. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Advances the state-of-the-art in image autoencoders, potentially improving generative model capabilities.

RANK_REASON This is a research paper detailing a new model architecture and its performance. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

COVERAGE [1]

arXiv cs.LG TIER_1 · Philippe Hansen-Estruch, Jiahui Chen, Vivek Ramanujan, Orr Zohar, Yan Ping, Animesh Sinha, Markos Georgopoulos, Edgar Schoenfeld, Ji Hou, Felix Juefei-Xu, Sriram Vishwanath, Ali Thabet · 2026-05-08 04:00

ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters

arXiv:2605.05331v1 Announce Type: cross Abstract: Vision Transformer (ViT) autoencoders have emerged as compelling tokenizers for images, offering improved reconstruction over convolutional tokenizers. However, existing ViT tokenizers cannot explore this landscape as performance …

COVERAGE [1]

ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters

RELATED ENTITIES

RELATED TOPICS