RESEARCH27

Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

arXiv CS.CL·May 5, 2026

This paper introduces a perplexity-based method to reveal finetuning objectives of large language models, particularly those exhibiting "model organism" behaviors. This method leverages models' tendency to overgeneralize, generating and ranking completions to identify the finetuning goals without prior assumptions.

Finetuning Perplexity model safety Research Methods LLM

Read original ↗