Interactions Between Crosscoder Features: A Compact Proofs Perspective
Researchers have developed a new method to quantify interactions between features in neural networks, using a technique called compact proofs. This approach allows for the creation of more computationally sparse models by penalizing feature interactions during training. The method also aids in identifying semantically meaningful feature clusters and has implications for understanding phenomena like sleeper agents. AI
IMPACT Provides a new tool for understanding and potentially optimizing neural network architectures by quantifying feature interactions.