Researchers have developed the Visual Aesthetic Benchmark (VAB) to evaluate how well multimodal large language models (MLLMs) can judge beauty in images. Their study found that current frontier MLLMs perform significantly worse than human experts in comparative aesthetic evaluations. Even the strongest tested system correctly identified the best and worst images in only 26.5% of tasks, compared to 68.9% for human experts, highlighting a gap in AI's aesthetic judgment capabilities. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Highlights a significant gap in AI's ability to perform nuanced aesthetic judgments, potentially impacting creative AI applications.
RANK_REASON The cluster describes a new academic benchmark and evaluation of existing models against it. [lever_c_demoted from research: ic=1 ai=1.0]