Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers
Researchers have developed ViSAE, a new toolbox designed to interpret and steer the behavior of Vision Transformers (ViTs). Inspired by neuroscience, ViSAE uses sparse autoencoders to decompose ViT representations into understandable concepts, addressing limitations in concept coverage and interpretation accuracy. The system includes an efficient probing suite, algorithms for tracing concept circuits, and applications for auditing and steering ViT outputs, notably improving worst-group accuracy on specific datasets. AI
IMPACT Enhances interpretability of vision models, potentially enabling safer deployment and more targeted behavior modification.