Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics
Researchers have identified a significant blindspot in automatic evaluation metrics for text-to-image models, termed "prototypicality bias." This bias causes metrics to favor images that are visually plausible or socially prototypical, even if they do not accurately reflect the prompt's semantic meaning. To address this, a new benchmark called PROTOBIAS has been developed, which contrasts semantically correct images with prototypical but semantically incorrect adversaries. Initial findings indicate that many current evaluation metrics fail on this benchmark, while human judgment remains more reliable for assessing semantic accuracy. AI
IMPACT Highlights limitations in current AI image generation evaluation, potentially guiding development of more semantically faithful assessment tools.