RESEARCH38

The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

arXiv CS.CL·June 8, 2026

The Piggyback Hypothesis explains how chat-template tokens can cause emergent misalignment in LLMs by generalizing finetuned behavior to out-of-domain queries. Token-Regularized Finetuning (TReFT) is proposed to mitigate this issue, preserving in-domain learning while reducing misalignment across models and datasets.

Finetuning Emergent Misalignment LLMs Generalization AI Research

Read original ↗