Advice for making robust-to-training model organisms
Researchers have identified key factors for creating more robust "model organisms" used to test AI alignment techniques. They found that prompted model organisms are highly fragile and should be avoided, while full-weight fine-tuning (FWFT) offers greater robustness compared to methods like LoRA. The study also noted that password-locked organisms are less resilient, and certain behaviors, particularly simple and instruction-compatible ones, tend to be more robust. AI
IMPACT Improves methods for testing AI alignment techniques, leading to more reliable evaluations of future AI systems.