新的AI方法通过关联文本与图像证据来增强视觉定位

作者 PulseAugur 编辑部 · [4 个来源] · 2026-06-15 03:17

研究人员开发了新的视觉定位方法，使AI模型能够更好地将自然语言描述与图像中的特定区域联系起来。一种名为“视觉推理”的方法训练模型将文本推理与明确的视觉证据交织在一起，提高了在计数和空间推理任务上的性能，甚至可以媲美更大的模型。另一种方法LazyMCoT采用自适应路由和协作定位，无需特定任务的训练即可高效地处理困难的图像查询，在准确性上可与监督方法相媲美，同时缩短了推理时间。第三个框架RSVG-ZeroOV使用冻结的基础模型进行遥感中的零样本开放词汇视觉定位，结合了视觉-语言模型和扩散模型，逐步完善定位结果，并在没有手动标注的情况下处理复杂查询。 AI

影响视觉定位的这些进步可能带来更直观、可验证的AI交互，从而改进机器人技术、图像分析和人机界面等领域的应用。

排序理由该集群包含多篇学术论文，详细介绍了AI领域（特别是视觉定位）的新研究方法和模型。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。我们如何撰写摘要 →

报道来源 [4]

arXiv cs.AI TIER_1 English(EN) · Junkai Zhang, Yihe Deng, Kai-Wei Chang, Wei Wang · 2026-06-16 04:00

Thinking with Visual Grounding

arXiv:2606.16122v1 Announce Type: new Abstract: Visual thinking should not only sound right; it should show its evidence. While recent vision-language models (VLMs) can produce natural-language reasoning traces, these traces often leave the supporting image regions implicit, maki…
arXiv cs.CL TIER_1 English(EN) · Yifan Wang, Peiming Li, Shiyu Li, Zhiyuan Hu, Xiaochen Yang, Wenming Yang, Yang Tang, Zheng Wei · 2026-06-16 04:00

Focus When Necessary: Adaptive Routing and Collaborative Grounding for Training-Free Visual Grounding

arXiv:2606.16158v1 Announce Type: cross Abstract: While Multimodal Large Language Models (MLLMs) excel in cross-modal reasoning, they often struggle to perceive fine-grained details in complex high-resolution images. Recent training-free methods address this through image scaling…
arXiv cs.CL TIER_1 English(EN) · Zheng Wei · 2026-06-15 03:17

Focus When Necessary: Adaptive Routing and Collaborative Grounding for Training-Free Visual Grounding

While Multimodal Large Language Models (MLLMs) excel in cross-modal reasoning, they often struggle to perceive fine-grained details in complex high-resolution images. Recent training-free methods address this through image scaling and localized cropping. However, applying these m…
arXiv cs.CV TIER_1 English(EN) · Ke Li, Di Wang, Yongshan Zhu, Ting Wang, Weiping Ni, Tao Lei, Quan Wang, Xinbo Gao · 2026-06-16 04:00

Training-Free Open-Vocabulary Visual Grounding for Remote Sensing Images and Videos

arXiv:2606.16124v1 Announce Type: new Abstract: Remote sensing visual grounding (RSVG) aims to localize a referred target in a remote sensing image or video according to a natural language expression. Existing RSVG methods usually rely on task-specific manual annotations, which a…

报道来源 [4]

Thinking with Visual Grounding

Focus When Necessary: Adaptive Routing and Collaborative Grounding for Training-Free Visual Grounding

Focus When Necessary: Adaptive Routing and Collaborative Grounding for Training-Free Visual Grounding

Training-Free Open-Vocabulary Visual Grounding for Remote Sensing Images and Videos

相关实体

相关话题