$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones
Researchers have developed a new method called $A^2$ that improves visual classification by better localizing foreground objects. Surprisingly, smaller self-supervised Vision Transformers (ViTs) produce more accurate attention maps for localization than larger ones. The $A^2$ method combines a small ViT for attention-based cropping with a large ViT for rich feature extraction, achieving competitive results across five benchmarks without requiring group labels or dataset-specific training. AI
IMPACT Improves object localization in visual classification tasks by combining small and large ViTs.