OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs
Researchers have introduced OVO-S-Bench, a new benchmark designed to evaluate the spatial intelligence of multimodal large language models (MLLMs) in streaming environments. This benchmark features 1,680 questions across 348 videos, with a focus on continuous egocentric streams relevant to robotics and autonomous driving. Initial evaluations show that Gemini-3.1-Pro lags significantly behind human experts, particularly in allocentric mapping tasks, and surprisingly, specialized streaming MLLMs underperform their base models. AI
IMPACT Establishes a new, demanding testbed for streaming spatial MLLMs, highlighting current limitations and guiding future development.