New benchmark reveals AI models lag human experts in judging image beauty

By PulseAugur Editorial · [1 sources] · 2026-05-12 19:33

Researchers have developed the Visual Aesthetic Benchmark (VAB) to evaluate how well multimodal large language models (MLLMs) can judge beauty in images. Their study found that current frontier MLLMs perform significantly worse than human experts in comparative aesthetic evaluations. Even the strongest tested system correctly identified the best and worst images in only 26.5% of tasks, compared to 68.9% for human experts, highlighting a gap in AI's aesthetic judgment capabilities. AI

IMPACT Highlights a significant gap in AI's ability to perform nuanced aesthetic judgments, potentially impacting creative AI applications.

RANK_REASON The cluster describes a new academic benchmark and evaluation of existing models against it. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.CV TIER_1 English(EN) · Zhangchen Xu · 2026-05-12 19:33

Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

Multimodal large language models (MLLMs) are now routinely deployed for visual understanding, generation, and curation. A substantial fraction of these applications require an explicit aesthetic judgment. Most existing solutions reduce this judgment to predicting a scalar score f…

COVERAGE [1]

Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

RELATED ENTITIES

RELATED TOPICS