Planning with Unified Multimodal Models
Researchers have introduced Uni-Plan, a novel planning framework that leverages unified multimodal models (UMMs) for enhanced decision-making. Unlike previous methods that rely solely on language-based reasoning, Uni-Plan utilizes UMMs to process both multimodal inputs and outputs, enabling reasoning through generated visual content. The framework integrates the policy, dynamics model, and value function into a single model and employs a self-discriminated filtering technique to prevent hallucinations in dynamics predictions. Experiments demonstrate that Uni-Plan significantly improves success rates in embodied decision-making tasks compared to vision-language model (VLM) based approaches, showcasing strong data scalability and outperforming existing methods with similar training data sizes. AI
IMPACT This framework could enable more robust AI decision-making by integrating visual reasoning, potentially improving performance in complex embodied tasks.