English(EN) Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

Motion-MLLM 通过自我运动数据增强 3D 场景理解

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-08 04:00

研究人员开发了 Motion-MLLM，一个将惯性测量单元 (IMU) 的自我运动数据与视频相结合的新框架，以增强用于 3D 场景理解的多模态大语言模型 (MLLM)。该方法使用级联运动-视觉关键帧过滤模块和非对称跨模态融合模块，将视觉内容与物理轨迹相结合，从而能够推理绝对尺度和空间关系。评估表明，与现有方法相比，Motion-MLLM 在实现具有竞争力的准确性的同时，显著提高了处理速度。 AI

影响通过整合自我运动数据增强了 MLLM 在 3D 场景理解方面的能力，有望改进机器人和自主系统中的应用。

排序理由这是一篇发表在 arXiv 上的研究论文，详细介绍了一个改进 MLLM 的新框架。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CV TIER_1 English(EN) · Shuyao Shi, Kang G. Shin · 2026-05-08 04:00

Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

arXiv:2603.17980v2 Announce Type: replace Abstract: Recent Multimodal Large Language Models (MLLMs) have shown high potential for spatial reasoning within 3D scenes. However, they typically rely on computationally expensive 3D representations like point clouds or reconstructed Bi…

报道来源 [1]

Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

相关实体

相关话题