From Per-Image Low-Rank to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers
Researchers have identified a key issue in feature distillation for Vision Transformers (ViTs), particularly when compressing models. They discovered that while individual images are compressible, the overall dataset exhibits a complex structure with rotating low-rank subspaces. This 'encoding mismatch' means that standard distillation methods fail because the token-level energy distribution across channels doesn't align with the teacher model's architecture. To address this, the paper proposes two simple fixes: 'Lift,' which adds a lightweight projector at inference, and 'WideLast,' which widens the student's final block. These methods significantly improve the performance of compressed ViTs, as demonstrated on ImageNet-1K. AI
IMPACT Offers new techniques to improve the efficiency and performance of Vision Transformer models, crucial for deployment on resource-constrained devices.