Brief · PulseAugur

TOOL · arXiv cs.CV English(EN) · 4d

Accelerating Vision Foundation Models with Drop-in Depthwise Convolution

Researchers have developed a new method to speed up vision foundation models by replacing certain attention heads in Vision Transformer (ViT) backbones with efficient depthwise convolution layers. This drop-in replacement achieves a 17-20% inference speedup with minimal performance loss on image classification and segmentation tasks. The approach includes strategies for identifying replaceable heads and a fine-tuning procedure to restore downstream task performance, with a reference implementation made publicly available. AI

IMPACT Accelerates inference for vision foundation models, potentially enabling wider deployment on resource-constrained devices.

arXiv
Vision Transformer
depthwise convolution