LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?")
Researchers have developed LEVANTE-bench, a new benchmark designed to compare the cognitive abilities of vision-language models (VLMs) with those of children. The benchmark utilizes tasks and data from the LEVANTE project, assessing VLMs against 1,547 children aged 5-12 across three countries. Findings indicate that while more capable VLMs align better with children's performance on tasks and items, their error patterns do not consistently match human children's, with smaller models sometimes better reflecting younger children's mistakes. Notably, even top-performing VLMs struggled with complex reasoning tasks like matrix reasoning and mental rotation, suggesting current VLM architectures only partially mirror human cognitive development. AI
IMPACT Introduces a novel method for evaluating VLM cognitive alignment with human development, potentially guiding future model improvements.