Shape of Thought: Progressive Object Assembly via Visual Chain-of-Thought
Researchers have introduced Shape-of-Thought (SoT), a novel visual Chain-of-Thought framework designed to improve the compositional structure in text-to-image generation. This framework trains a multimodal autoregressive model to produce interleaved textual plans and intermediate visual states, enabling better handling of challenges like attribute binding and part-level relations without requiring explicit geometric representations. To support SoT, a new dataset called SoT-26K and a benchmark named T2S-CompBench have been developed. Fine-tuning with SoT-26K has shown significant improvements in component numeracy and structural topology compared to direct generation methods. AI
IMPACT Enhances compositional control in text-to-image models, potentially leading to more accurate and structured visual outputs.