Probing for Representation Manifolds in Superposition
Researchers have developed a new method called the Manifold Probe to identify and understand how concepts are represented within AI models. This technique extends linear regression probes to discover and learn the directions used to encode specific features. When applied to Llama 2-7b, the Manifold Probe successfully identified manifolds for time and space, and manipulating the time manifold influenced the model's output regarding release dates of cultural works. AI
IMPACT Introduces a novel method for analyzing internal model representations, potentially aiding in interpretability and control.