m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning
Researchers have introduced m2sv, a new benchmark designed to test the spatial reasoning capabilities of vision-language models (VLMs). The benchmark challenges models to align overhead map views with egocentric street-level imagery, a task where current VLMs struggle. Despite advancements in multimodal AI, the top-performing VLM achieved only 65.2% accuracy on m2sv, significantly lower than human annotators. AI
IMPACT Highlights persistent gaps in geometric alignment and reasoning for vision-language models, motivating future research.