LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer
Researchers have introduced LaTtE-Flow, a novel architecture that unifies image understanding and generation within a single multimodal model. This approach leverages pretrained Vision-Language Models and incorporates a Layerwise Timestep-Expert flow-based design. By distributing the flow-matching process across specialized Transformer layers, LaTtE-Flow significantly enhances sampling efficiency, achieving approximately six times faster inference speeds compared to existing unified multimodal models while maintaining competitive image generation quality. AI
IMPACT This architecture could accelerate the deployment of multimodal AI systems by improving generation speeds.