Google Cloud has open-sourced AMS (Activation Model Scanner), a tool that analyzes the geometric structure of a model's activation space to verify safety training. Unlike traditional behavioral tests, AMS directly inspects the model's weights for evidence of safety alignment. Initial tests on three open-source models (TinyLlama, distilgpt2, and Qwen2.5-0.5B) all resulted in a 'CRITICAL' rating, indicating a lack of effective safety training or significant deviations from safety benchmarks. AI
IMPACT This tool offers a novel, weight-level approach to LLM safety verification, potentially improving supply chain security and CI/CD pipelines for AI models.
RANK_REASON The cluster describes the release and practical application of a new open-source tool for evaluating LLM safety, including experimental results.
- AMS (Activation Model Scanner)
- Apache 2.0
- Constitutional AI
- distilgpt2
- GitHub Actions
- Google Cloud
- LlamaGuard
- Meta LLaMA-3 Instruct
- Mistral Instruct v3
- Qwen2.5-0.5B
- RLHF
- TinyLlama
- WildGuard
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →