Brief · PulseAugur

RESEARCH · Hugging Face Daily Papers English(EN) · 1w · [2 sources]

$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

Researchers have developed a new method called $A^2$ that improves visual classification by better localizing foreground objects. Surprisingly, smaller self-supervised Vision Transformers (ViTs) produce more accurate attention maps for localization than larger ones. The $A^2$ method combines a small ViT for attention-based cropping with a large ViT for rich feature extraction, achieving competitive results across five benchmarks without requiring group labels or dataset-specific training. AI

IMPACT Improves object localization in visual classification tasks by combining small and large ViTs.

ViT
Sreehari Rammohan
Hugging Face
arXiv
ViTs