Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry
Researchers have developed a new method called Geometry-Lite to analyze how large language models (LLMs) process safety-related information. This technique uses layer-wise margin geometry to interpret the separation between safe and unsafe prompts within the model's internal representations. Experiments across various LLMs and safety benchmarks indicate that safety evidence is primarily conveyed through persistent margin geometry rather than layer-to-layer movement. AI
IMPACT Introduces a novel interpretability tool for understanding and potentially improving the safety mechanisms within large language models.