Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 12h

m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

Researchers have introduced m2sv, a new benchmark designed to test the spatial reasoning capabilities of vision-language models (VLMs). The benchmark challenges models to align overhead map views with egocentric street-level imagery, a task where current VLMs struggle. Despite advancements in multimodal AI, the top-performing VLM achieved only 65.2% accuracy on m2sv, significantly lower than human annotators. AI

IMPACT Highlights persistent gaps in geometric alignment and reasoning for vision-language models, motivating future research.

vision-language models
Yosub Shin