Researchers have introduced TurtleAI, a new benchmark designed to evaluate vision-language models (VLMs) on educational visual programming tasks using Turtle Graphics. The benchmark, comprising 823 tasks, revealed that over 20 leading VLMs, including GPT-5 and GPT-4o, struggle significantly, with success rates often below 30%. A proposed data generation technique and fine-tuning Qwen2-VL-72B showed a notable improvement of approximately 20% on real-world tasks, highlighting the models' difficulties with spatial reasoning and precise visual replication. AI
IMPACT Highlights limitations in current VLMs for educational visual programming, suggesting areas for future model development.
RANK_REASON The cluster contains a research paper introducing a new benchmark for evaluating AI models.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →