Brief · PulseAugur

RESEARCH · arXiv cs.CV English(EN) · 3d · [2 sources]

Vision Transformers Need Better Token Interaction

Researchers have identified a phenomenon called "semantic diffusion" that degrades the performance of Vision Transformers (ViTs) in dense prediction tasks over time. This occurs when global semantic information spreads inappropriately through patch tokens. To address this, the study proposes using sparse attention mechanisms, specifically entmax-1.5, to make token interactions more selective. This modification significantly improved performance on semantic segmentation benchmarks like VOC, ADE20K, and Cityscapes while maintaining image-level accuracy. AI

IMPACT Selective token mixing in Vision Transformers could enhance performance in computer vision tasks like semantic segmentation.

entmax-1.5
Cityscapes
ADE20K
Vision Transformers
ImageNet
semantic diffusion
ImageNet-1K