UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register
Researchers have developed UniRefiner, a framework designed to improve the spatial accuracy of Vision Transformer (ViT) models. This method teaches pre-trained ViTs to identify and discard irrelevant or spurious tokens that can degrade performance on spatially sensitive tasks. By using contrastive registers and a dual objective, UniRefiner refines diverse ViTs with minimal fine-tuning, leading to significant improvements in tasks like semantic segmentation. AI
IMPACT Enhances the spatial reasoning capabilities of foundation vision models, potentially broadening their applicability in dense prediction tasks.