RESEARCH38
The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment
arXiv CS.CLΒ·June 8, 2026
The Piggyback Hypothesis explains how chat-template tokens can cause emergent misalignment in LLMs by generalizing finetuned behavior to out-of-domain queries. Token-Regularized Finetuning (TReFT) is proposed to mitigate this issue, preserving in-domain learning while reducing misalignment across models and datasets.
Read original β