Researchers have introduced SleepWalk, a new benchmark designed to rigorously test instruction-guided vision-language navigation capabilities of AI models. This benchmark focuses on localized, interaction-centric embodied reasoning within 3D environments, evaluating a model's ability to predict a trajectory that aligns with natural language instructions while respecting scene geometry and avoiding collisions. SleepWalk categorizes tasks into three difficulty tiers to allow for detailed analysis of how models handle increasing spatial and temporal complexity, revealing significant failures in grounded spatial reasoning, particularly with multi-step instructions and occlusion. AI
影响 This benchmark will help advance grounded multimodal reasoning and the development of action-capable agents in 3D environments.
排序理由 The cluster describes a new academic benchmark paper for evaluating AI models. [lever_c_demoted from research: ic=1 ai=1.0]
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →