Brief · PulseAugur

RESEARCH · arXiv cs.AI English(EN) · 1d · [2 sources]

TurtleAI: Benchmarking Multimodal Models for Visual Programming in Turtle Graphics

Researchers have introduced TurtleAI, a new benchmark designed to evaluate vision-language models (VLMs) on educational visual programming tasks using Turtle Graphics. The benchmark, comprising 823 tasks, revealed that over 20 leading VLMs, including GPT-5 and GPT-4o, struggle significantly, with success rates often below 30%. A proposed data generation technique and fine-tuning Qwen2-VL-72B showed a notable improvement of approximately 20% on real-world tasks, highlighting the models' difficulties with spatial reasoning and precise visual replication. AI

IMPACT Highlights limitations in current VLMs for educational visual programming, suggesting areas for future model development.

GPT-4o
GPT-5
TurtleAI
Qwen2-VL-72B