$A^2$ method uses small ViTs for better object localization

By PulseAugur Editorial · [2 sources] · 2026-06-02 04:45

Researchers have developed a new method called $A^2$ that improves visual classification by better localizing foreground objects. Surprisingly, smaller self-supervised Vision Transformers (ViTs) produce more accurate attention maps for localization than larger ones. The $A^2$ method combines a small ViT for attention-based cropping with a large ViT for rich feature extraction, achieving competitive results across five benchmarks without requiring group labels or dataset-specific training. AI

IMPACT Improves object localization in visual classification tasks by combining small and large ViTs.

RANK_REASON The cluster contains an academic paper detailing a new method for visual classification.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 04:45

$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

Robust visual classification often depends on localizing the main foreground objects in an image while ignoring contextual distractors. Surprisingly, we find that the attention maps of smaller self-supervised ViTs localize foreground objects better than those of larger ViTs. Howe…
arXiv cs.CV TIER_1 English(EN) · Sreehari Rammohan, Huy Ha, Carl Vondrick · 2026-06-03 04:00

$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

arXiv:2606.03148v1 Announce Type: new Abstract: Robust visual classification often depends on localizing the main foreground objects in an image while ignoring contextual distractors. Surprisingly, we find that the attention maps of smaller self-supervised ViTs localize foregroun…

COVERAGE [2]

$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

RELATED ENTITIES

RELATED TOPICS