English(EN) Latent Visual States for Efficient Multimodal Reasoning

新AI研究聚焦多模态推理、效率和机器人感知

作者 PulseAugur 编辑部 · [12 个来源] · 2026-06-23 07:22

arXiv上发布的几篇研究论文提出了改进AI模型多模态推理的新方法。VISE（Visual Invariance Self-Evolution）通过强制空间和语义不变性来解决视觉欠条件问题，在图像字幕和VQA任务上取得了显著的进步。Visual-OPSD通过将使用特权视觉思维的教师模型的知识蒸馏到一个纯文本学生模型中，专注于高效推理，实现了显著的加速。另一种方法Ask, Solve, Generate，利用自我一致性奖励在没有外部监督的情况下自主改进视觉理解和图像生成。Position Rebinding Cache Reuse (PRCR)解决了视觉缓存中过时的位置绑定问题，实现了无重放的视觉重访并减少了计算量。最后，OctoSense提出了一个使用多种传感器进行多模态机器人感知的自监督学习框架，在各种任务上的表现优于仅图像模型。 AI

影响这些论文引入了改进多模态推理、效率和机器人感知的新技术，有望提升AI系统在复杂任务中的能力。

排序理由 arXiv上发表的多篇研究论文详细介绍了多模态AI的新方法。

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 12 个来源。我们如何撰写摘要 →

报道来源 [12]

arXiv cs.CV TIER_1 English(EN) · Shravan Venkatraman, Ritesh Thawkar, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Salman Khan, Fahad Khan · 2026-06-26 04:00

Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models

arXiv:2606.27373v1 Announce Type: new Abstract: Recently, self-evolving large multimodal models (LMMs) have received attention for improving visual reasoning in a purely unsupervised setting. However, multi-role self-play and self-consistency reward schemes in existing self-evolv…
arXiv cs.CV TIER_1 English(EN) · Pengyu Li, Zhitao Gao, Lingling Zhang, Muye Huang, Yuanming Li, Fangzhi Xu, Jun Liu · 2026-06-26 04:00

Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

arXiv:2606.18974v2 Announce Type: replace Abstract: Unified multimodal models (UMMs) interleave generated ''visual thoughts'' (VTs) with text reasoning to improve spatial tasks. This incurs roughly an order-of-magnitude inference cost from multi-step diffusion. We find this cost …
arXiv cs.CV TIER_1 English(EN) · Ritesh Thawkar, Shravan Venkatraman, Omkar Thawakar, Abdelrahman Shaker, Fahad Khan, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer · 2026-06-26 04:00

Ask, Solve, Generate: Self-Evolving Unified Multimodal Understanding and Generation via Self-Consistency Rewards

arXiv:2606.27376v1 Announce Type: new Abstract: Most unified large multimodal models (LMMs) that support both visual understanding and image generation still rely on curated post-training supervision, such as human annotations, preference labels, or external reward models. We ask…
arXiv cs.CV TIER_1 English(EN) · Mengzhao Wang, Yanli Ji, Wangmeng Zuo, Peng Ye, Chongjun Tu · 2026-06-26 04:00

Position Rebinding Cache Reuse: Replay-Free Visual Revisiting for Interleaved Multimodal Reasoning

arXiv:2606.26631v1 Announce Type: new Abstract: Interleaved multimodal reasoning improves visual grounding by revisiting visual evidence during multi-step generation, yet existing methods typically rely on token replay, repeatedly forwarding selected visual tokens. A natural shor…
arXiv cs.CV TIER_1 English(EN) · Anthony Bisulco, Jeremy Wang, Kostas Daniilidis, Randall Balestriero, Pratik Chaudhari · 2026-06-26 04:00

OctoSense: Self-Supervised Learning for Multimodal Robot Perception

arXiv:2606.27317v1 Announce Type: new Abstract: We present OctoSense, an open-source sensor platform with stereo RGB and event cameras, LiDAR, a thermal camera, an inertial measurement unit, RTK-corrected global positioning system, and proprioception (CAN bus data from a car, and…
arXiv cs.CV TIER_1 English(EN) · Rao Muhammad Anwer · 2026-06-25 17:59

Ask, Solve, Generate: Self-Evolving Unified Multimodal Understanding and Generation via Self-Consistency Rewards

Most unified large multimodal models (LMMs) that support both visual understanding and image generation still rely on curated post-training supervision, such as human annotations, preference labels, or external reward models. We ask whether a unified LMM can improve both abilitie…
arXiv cs.CV TIER_1 English(EN) · Fahad Khan · 2026-06-25 17:59

Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models

Recently, self-evolving large multimodal models (LMMs) have received attention for improving visual reasoning in a purely unsupervised setting. However, multi-role self-play and self-consistency reward schemes in existing self-evolving LMMs optimize answer agreement without ensur…
arXiv cs.CV TIER_1 English(EN) · Pratik Chaudhari · 2026-06-25 17:30

OctoSense: Self-Supervised Learning for Multimodal Robot Perception

We present OctoSense, an open-source sensor platform with stereo RGB and event cameras, LiDAR, a thermal camera, an inertial measurement unit, RTK-corrected global positioning system, and proprioception (CAN bus data from a car, and joint angles for a quadruped robot). The eponym…
arXiv cs.CV TIER_1 English(EN) · Chongjun Tu · 2026-06-25 05:47

位置重绑定缓存复用：无重放交错多模态推理视觉重访

Interleaved multimodal reasoning improves visual grounding by revisiting visual evidence during multi-step generation, yet existing methods typically rely on token replay, repeatedly forwarding selected visual tokens. A natural shortcut is to reuse the historical visual key-value…
arXiv cs.CV TIER_1 English(EN) · Xiuwei Chen, Wentao Hu, Yongxin Wang, Zisheng Chen, Likui Zhang, Kun Xiang, Jianhua Han, Hui-Ling Zhen, Jingyuan Zou, Hang Xu, Xiaodan Liang · 2026-06-24 04:00

Latent Visual States for Efficient Multimodal Reasoning

arXiv:2606.24233v1 Announce Type: new Abstract: The integration of visual evidence has significantly enhanced the capabilities of large multimodal models. However, this integration predominantly relies on generating discrete outputs (etc., code or box coordinates) to invoke exter…
arXiv cs.CV TIER_1 English(EN) · Xiaodan Liang · 2026-06-23 07:22

Latent Visual States for Efficient Multimodal Reasoning

The integration of visual evidence has significantly enhanced the capabilities of large multimodal models. However, this integration predominantly relies on generating discrete outputs (etc., code or box coordinates) to invoke external tools, a process that introduces rigid depen…
dev.to — LLM tag TIER_1 English(EN) · Devanshu Biswas · 2026-06-25 18:59

Multimodal AI: One Model That Sees, Reads, and Hears

<p>The models you use now don't just read — they see, hear, and read images too. GPT-4o, Gemini, Claude with vision: all multimodal. The trick that makes it work is the same embeddings idea, stretched across senses. Here's how, visualized.</p> <p>👁️‍🗨️ <strong>Watch modalities me…

报道来源 [12]

相关实体

相关话题