New benchmark reveals AI models lag human experts in judging image beauty

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have developed the Visual Aesthetic Benchmark (VAB) to evaluate how well multimodal large language models (MLLMs) can judge beauty in images. Their study found that current frontier MLLMs perform significantly worse than human experts in comparative aesthetic evaluations. Even the strongest tested system correctly identified the best and worst images in only 26.5% of tasks, compared to 68.9% for human experts, highlighting a gap in AI's aesthetic judgment capabilities. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Highlights a significant gap in AI's ability to perform nuanced aesthetic judgments, potentially impacting creative AI applications.

RANK_REASON The cluster describes a new academic benchmark and evaluation of existing models against it. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

COVERAGE [1]

arXiv cs.CV TIER_1 · Zhangchen Xu · 2026-05-12 19:33

Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

Multimodal large language models (MLLMs) are now routinely deployed for visual understanding, generation, and curation. A substantial fraction of these applications require an explicit aesthetic judgment. Most existing solutions reduce this judgment to predicting a scalar score f…

COVERAGE [1]

Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

RELATED TOPICS