Comparing Linear Probes with Mahalanobis Cosine Similarity
Researchers have theoretically and empirically demonstrated that Mahalanobis Cosine Similarity (MCS) is a strong predictor of a linear probe's Out-of-Distribution AUROC. This relationship holds across various models, layers, and concept domains. The study proves that for balanced classes with Gaussian projections, both OOD AUROC and MCS to a reference probe are linear functions of the probe's signal-to-noise ratio on test data. MCS is presented as a theoretically sound and practically effective alternative to Euclidean cosine similarity for comparing linear probes in interpretability research. AI
IMPACT Provides a theoretically grounded method for evaluating AI model interpretability, potentially improving understanding of model behavior.