What-Where Transformer 将物体外观与位置分离开

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-12 12:08

研究人员推出了一种新颖的视觉骨干网络 What-Where Transformer (WWT)，旨在更好地将物体外观与其空间位置分离开。这种新架构采用基于槽（slot）的设计，其中 token 代表物体的“是什么”，而注意力图（attention maps）代表“在哪里”。WWT 即使在仅使用标准分类监督进行训练的情况下，也能在直接从注意力图中发现多个物体方面展现出新兴能力，并在零样本物体发现和弱监督语义分割任务上表现出改进的性能。 AI

影响为视觉模型引入了新的架构偏置，有望改进定位任务和新兴物体发现。

排序理由该集群包含一篇详细介绍新颖模型架构的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CV TIER_1 English(EN) · Ikuro Sato · 2026-05-12 12:08

What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization

Many image understanding tasks involve identifying what is present and where it appears. However, tasks that address where, such as object discovery, detection, and segmentation, are often considerably more complex than image classification, which primarily focuses on what. One p…

报道来源 [1]

What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization

相关实体

相关话题