Brief · PulseAugur

RESEARCH · arXiv cs.CV English(EN) · 21h · [2 sources]

Vision Transformers for Face Recognition Need More Registers

Researchers have developed a new method using register tokens to improve the interpretability and performance of Vision Transformers (ViTs) for face recognition. By adding learnable register tokens to the initial patch embeddings, the ViT-8R model demonstrates more structured and understandable attention maps compared to standard CLS-token or Concatenated Patch Embeddings (CPE) approaches. This enhancement not only mitigates interpretability artifacts but also achieves state-of-the-art results on large-scale benchmarks like IJB-B and IJB-C. AI

IMPACT Enhances interpretability of ViTs for face recognition, potentially leading to more trustworthy and accurate systems.

face recognition
Vision Transformers
IJB-C
IJB-B
register tokens
ViT-8R