Researchers have investigated how Vision Transformers (ViTs) encode spatial information without explicit spatial supervision during pretraining. By probing a ViT-B/16 model, they found that boundary structure is decodable by layer 5-6, while depth information, requiring more global cues, becomes decodable two to three layers later. This learned spatial hierarchy within the ViT mirrors the progression observed in the primate visual cortex. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Reveals how classification-trained ViTs develop an internal spatial hierarchy, potentially informing future model architectures.
RANK_REASON Academic paper analyzing the internal workings of Vision Transformers.