Spatial Priors via Space Filling Curves for Small and Limited Data Vision Transformers
Researchers have developed VIOLIN, a novel masked attention mechanism for Vision Transformers (ViTs) that enhances their ability to process images with limited data or smaller model capacities. By encoding spatial structure through Space Filling Curves (SFCs), VIOLIN adds minimal parameters and computational overhead while significantly improving performance across various computer vision tasks. Evaluations show accuracy boosts of up to 8.7% on tasks requiring spatial information and up to 7.2% on pixel-level tasks, demonstrating its effectiveness in both fine-tuning and pre-training scenarios. AI
IMPACT Enhances Vision Transformer performance on limited data, potentially broadening their applicability in resource-constrained environments.