From Mechanistic to Compositional Interpretability
Researchers have introduced a new framework called compositional interpretability, which uses category theory to provide a formal and verifiable method for understanding neural network behavior. This approach aims to objectively compare and compose mechanistic explanations by defining them as pairs of syntactic and semantic mappings that must commute for consistency. The framework breaks down explanation quality into faithfulness and complexity, treating interpretability as an optimization problem and offering a method for restructuring models into simpler, functional parts. This work situates existing mechanistic methods as subclasses of refinement and provides a blueprint for automating the discovery and evaluation of these explanations. AI
IMPACT Provides a formal, verifiable method for understanding neural network behavior, potentially accelerating research and development.