When Confidence Lacks Concepts: Interpretable OOD Detection via Representation Perturbations
Researchers have developed a novel method for detecting out-of-distribution (OOD) data in deep neural networks, specifically targeting applications in medical imaging where reliability is paramount. This new framework utilizes sparse autoencoders (SAEs) to learn class-specific concept vectors, which are then used to perturb model representations. The stability of predictions under these semantic perturbations serves as an indicator for OOD detection, offering both a discriminative signal and an interpretable view into model uncertainty. AI
IMPACT This research introduces a more interpretable approach to OOD detection, crucial for safe deployment of AI in high-stakes fields like medicine.