When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?
A new paper explores the safety implications of the "think-with-image" reasoning paradigm in large vision-language models. Researchers found that systems using explicit image-tool interaction were significantly more robust against multimodal jailbreaks, reducing attack success rates by approximately 30% on average. This robustness was observed even when the image-tool output was manipulated, suggesting the benefit stems from the invocation process itself rather than the content of the output. The study proposes an "image-tool safety vector" framework to explain this phenomenon, modeling the invocation as a shift towards safety-relevant representations. AI
IMPACT Explicit image-tool interaction emerges as a promising method to enhance the safety of multimodal AI systems against jailbreaking attempts.