ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models
Researchers have introduced ERGeoBench, a new benchmark designed to evaluate the geo-localization capabilities of multimodal large language models (MLLMs) when acting as embodied agents. The benchmark assesses models across single-view, panorama-view, and embodied-view settings, utilizing over 2,200 street-view panoramas. Evaluations indicate that while current MLLMs can grasp high-level geographic concepts, they still face challenges with precise metric localization and maintaining spatial consistency across different views, highlighting the need for integrated perception and reasoning. AI
IMPACT Provides a standardized evaluation for embodied AI agents, pushing development in spatial reasoning and geo-localization.