PulseAugur
EN
LIVE 15:02:26

New dataset reveals MLLMs vulnerable to diverse video inputs

Researchers have developed a new dataset, MCV SafetyBench, to test the vulnerability of multimodal large language models (MLLMs) to malicious inputs. The dataset, comprising 2,920 videos, reveals that MLLMs are more susceptible to harmful content when presented with diverse, dynamic video inputs compared to static images. This research also highlights that the success rate of jailbreaking attacks increases with the number of video clips used, leading to the proposal of a defense strategy leveraging image modality robustness. AI

IMPACT Highlights potential security risks in video-processing AI and suggests new defense strategies.

RANK_REASON The cluster contains two academic papers discussing research into multimodal large language models.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

COVERAGE [3]

  1. arXiv cs.AI TIER_1 English(EN) · Choongwon Kang, Seungjong Sun, Hyunmin Jun, Jang Hyun Kim ·

    Jailbreaking Multimodal Large Language Models using Multi-Clip Video

    arXiv:2606.02111v1 Announce Type: cross Abstract: As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for malicious misuse. Prior jailbreak studies have shown that safety alignment in MLLMs can be bypassed…

  2. arXiv cs.AI TIER_1 English(EN) · Jang Hyun Kim ·

    Jailbreaking Multimodal Large Language Models using Multi-Clip Video

    As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for malicious misuse. Prior jailbreak studies have shown that safety alignment in MLLMs can be bypassed through visual inputs, yet it remains unclear whi…

  3. arXiv cs.CV TIER_1 English(EN) · Bingzheng Qu, Kehai Chen, Xuefeng Bai, Min Zhang ·

    Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey

    arXiv:2604.11283v2 Announce Type: replace Abstract: Recent progress in multimodal large language models (MLLMs) is reshaping video translation from a cascaded pipeline of automatic speech recognition, machine translation, text-to-speech, and lip synchronization into a unified mul…