Accelerating Vision Foundation Models with Drop-in Depthwise Convolution
Researchers have developed a new method to speed up vision foundation models by replacing certain attention heads in Vision Transformer (ViT) backbones with efficient depthwise convolution layers. This drop-in replacement achieves a 17-20% inference speedup with minimal performance loss on image classification and segmentation tasks. The approach includes strategies for identifying replaceable heads and a fine-tuning procedure to restore downstream task performance, with a reference implementation made publicly available. AI
IMPACT Accelerates inference for vision foundation models, potentially enabling wider deployment on resource-constrained devices.