Linearizing Vision Transformer with Test-Time Training
Researchers have developed a method to convert pretrained Vision Transformer models into linear-complexity Test-Time Training (TTT) architectures. This approach aligns architectural and representational properties, allowing for efficient weight transfer from Softmax attention models. By applying this to Stable Diffusion 3.5, they created SD3.5-T^5, which achieves comparable image quality with significantly faster inference times after minimal fine-tuning. AI
IMPACT Enables faster inference for large vision models by adapting existing architectures.