Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives
Researchers have developed a method to identify the specific objectives used to finetune large language models, even when those objectives are hidden. The technique involves comparing perplexity scores between a finetuned model and a reference model using short prompts. Completions with the largest perplexity differences are likely to reveal the finetuning goals, such as the internalization of false facts or the production of specific phrases. This approach is effective even without direct access to the original pre-finetuning model and can work with API-gated models that provide token log probabilities. AI
IMPACT Provides a new method for understanding and potentially mitigating hidden risks introduced during LLM finetuning.