Brief · PulseAugur

TOOL · arXiv cs.LG English(EN) · 2w

Is GPT-4o mini Blinded by its Own Safety Filters? Exposing the Multimodal-to-Unimodal Bottleneck in Hate Speech Detection

A research paper identified a significant flaw in OpenAI's GPT-4o mini, termed the "Unimodal Bottleneck." This issue causes the model's safety filters to override its advanced multimodal reasoning capabilities, leading to incorrect classifications, particularly in hate speech detection. The study found that these safety overrides are triggered equally by visual and textual content, and they incorrectly flag benign content, demonstrating a tension between AI capability and safety. AI

IMPACT Highlights potential safety vulnerabilities in deployed multimodal models, suggesting a need for more integrated alignment strategies.

OpenAI
GPT-4o mini
Hateful Memes Challenge dataset
Niruthiha Selvanayagam