Researchers have developed a new method for training multi-modal large language models (MLLMs) to improve their ability to reason with abstract relational knowledge presented in images. This approach involves an automatic data engine that synthesizes images with multi-modal relational knowledge and generates instruction data with chain-of-thought reasoning. The proposed two-stage capability enhancement framework, tested on a dataset of 64,000 samples, showed that smaller models could outperform GPT-4o on structured and abstractive reasoning tasks. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Introduces a novel training framework and dataset that enables smaller models to outperform GPT-4o on specific reasoning tasks.
RANK_REASON This is a research paper introducing a new dataset and training framework for multi-modal reasoning.