Modality Forcing for Scalable Spatial Generation
Researchers have developed a new post-training technique called Modality Forcing, which enables text-to-image models to generate both images and depth maps simultaneously. This method requires only sparse depth data and can be applied to existing Diffusion Transformer models. The technique demonstrates that larger models trained on more image data produce more accurate depth predictions, with the strongest model achieving competitive results against state-of-the-art monocular depth estimators. AI
IMPACT This technique could lead to more sophisticated AI models capable of understanding and generating 3D spatial information from 2D inputs.