Researchers have identified significant semantic invariances in popular image-to-text evaluation metrics. These metrics, including CLIPScore and others, show sensitivity to benign spatial edits and phrasing changes, leading to score shifts and ranking flips. A study confirmed that human annotators found perturbed image-caption pairs equally correct, indicating the metrics' behavior, not semantic changes. The researchers propose an invariance-calibrated scoring method to mitigate these issues. AI
IMPACT Highlights flaws in current image-text evaluation, potentially leading to more robust and reliable AI model assessments.
RANK_REASON The cluster contains an academic paper detailing a new evaluation methodology for image-text metrics. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →