The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment
The Piggyback Hypothesis explains how chat-template tokens can cause emergent misalignment in LLMs by generalizing finetuned behavior to out-of-domain queries. Token-Regularized Finetuning (TReFT) is proposed to mitigate this issue, preserving in-domain learning while reducing misalignment across models and datasets.



