PulseAugur
EN
LIVE 08:16:37

New OmniCoT benchmark targets MLLMs' panoramic reasoning skills

Researchers have introduced OmniCoT, a new benchmark suite designed to evaluate and improve the panoramic spatial reasoning capabilities of Multimodal Large Language Models (MLLMs). Existing benchmarks often overlook the full 360° potential of panoramic imagery, focusing instead on simpler, local cues. OmniCoT aims to enable MLLMs to perform complex, multi-step reasoning across viewpoints by providing structured Chain-of-Thought annotations for training and evaluation datasets. The suite includes OmniCoT-B for evaluation, OmniCoT-Real for assessing the sim-to-real gap, and OmniCoT-T for training, along with a two-stage training strategy that anchors reasoning to panoramic evidence and penalizes geometric incoherence. AI

IMPACT This benchmark could drive advancements in MLLMs' ability to understand and reason about complex 3D environments, crucial for embodied AI applications.

RANK_REASON The cluster describes a new benchmark and associated training methodology for evaluating MLLMs, published on arXiv.

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New OmniCoT benchmark targets MLLMs' panoramic reasoning skills

COVERAGE [2]

  1. arXiv cs.CV TIER_1 English(EN) · Haocong He, Chenfei Liao, Zichen Wen, Zihao Dongfang, Xu Zheng, Bin Ren, Chang Su, Zixin Zhang, Harold Haodong Chen, Hongfei Zhang, Weijia Li, Kailun Yang, Conghui He, Xuming Hu, Nicu Sebe, Linfeng Zhang ·

    OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning

    arXiv:2606.30378v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have demonstrated promising spatial reasoning capabilities, while these abilities remain underexplored in the emerging visual modality of panoramic imagery. The full 360{\deg}$\times$180{\deg…

  2. arXiv cs.CV TIER_1 English(EN) · Linfeng Zhang ·

    OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning

    Multimodal Large Language Models (MLLMs) have demonstrated promising spatial reasoning capabilities, while these abilities remain underexplored in the emerging visual modality of panoramic imagery. The full 360°$\times$180° field of view of panoramas essentially supports complex …