Vision Transformers improved with selective token interaction

作者 PulseAugur 编辑部 · [2 sources] · 2026-05-22 17:25

研究人员发现了一种称为“语义扩散”的现象，该现象会随着时间的推移降低 Vision Transformers (ViTs) 在密集预测任务中的性能。当全局语义信息不恰当地通过 patch tokens 扩散时会发生这种情况。为了解决这个问题，该研究提出使用稀疏注意力机制，特别是 entmax-1.5，使 token 交互更具选择性。这一改进显著提高了在 VOC、ADE20K 和 Cityscapes 等语义分割基准上的性能，同时保持了图像级别的准确性。 AI

影响 Vision Transformers 中选择性的 token 混合可以提高计算机视觉任务（如语义分割）的性能。

排序理由该集群包含一篇学术论文，详细介绍了一种改进现有 AI 模型的新方法。

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.CV TIER_1 · Linxiang Su · 2026-05-25 04:00

Vision Transformers Need Better Token Interaction

arXiv:2605.23868v1 Announce Type: new Abstract: Vision Transformers (ViTs) can learn strong image-level representations while their patch representations become less effective for dense prediction during prolonged training. We revisit this dense degradation phenomenon and argue t…
arXiv cs.CV TIER_1 · Linxiang Su · 2026-05-22 17:25

Vision Transformers Need Better Token Interaction

Vision Transformers (ViTs) can learn strong image-level representations while their patch representations become less effective for dense prediction during prolonged training. We revisit this dense degradation phenomenon and argue that it is not fully explained by high-norm artif…

报道来源 [2]

Vision Transformers Need Better Token Interaction

Vision Transformers Need Better Token Interaction

相关实体

相关话题