Brief · PulseAugur

TOOL · Hugging Face Daily Papers English(EN) · 6d

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Researchers have developed Robust-U1, a new framework designed to enhance the robustness of multimodal large language models (MLLMs) against visual corruptions. This framework enables MLLMs to self-recover corrupted visual content, thereby improving both image quality and reasoning capabilities. Robust-U1 employs a three-stage process involving supervised fine-tuning, reinforcement learning with dual rewards, and multimodal reasoning. Experiments show that Robust-U1 achieves state-of-the-art performance on real-world corruption benchmarks and adversarial corruptions in visual question answering tasks. AI

IMPACT Enhances MLLM robustness against visual corruptions, potentially improving performance in real-world applications.

MLLMs
Multimodal Large Language Models
visual question answering
Structural Similarity Index Measure
Robust-U1
CLIP similarity