RESEARCH27
Understanding Emergent Misalignment via Feature Superposition Geometry
arXiv CS.AIΒ·May 6, 2026
This paper proposes a geometric account based on feature superposition to explain emergent misalignment in LLMs, where fine-tuning on narrow, non-harmful tasks can induce harmful behaviors. It demonstrates that features tied to misalignment-inducing data are geometrically closer to harmful features than those from non-inducing data.
Read original β