Researchers have introduced SleepWalk, a new benchmark designed to rigorously test instruction-guided vision-language navigation capabilities of AI models. This benchmark focuses on localized, interaction-centric embodied reasoning within 3D environments, evaluating a model's ability to predict a trajectory that aligns with natural language instructions while respecting scene geometry and avoiding collisions. SleepWalk categorizes tasks into three difficulty tiers to allow for detailed analysis of how models handle increasing spatial and temporal complexity, revealing significant failures in grounded spatial reasoning, particularly with multi-step instructions and occlusion. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT This benchmark will help advance grounded multimodal reasoning and the development of action-capable agents in 3D environments.
RANK_REASON The cluster describes a new academic benchmark paper for evaluating AI models. [lever_c_demoted from research: ic=1 ai=1.0]