RESEARCHarXiv CS.AI·5/6/2026
Understanding Emergent Misalignment via Feature Superposition Geometry
This paper proposes a geometric account based on feature superposition to explain emergent misalignment in LLMs, where fine-tuning on narrow, non-harmful tasks can induce harmful behaviors. It demonstrates that features tied to misalignment-inducing data are geometrically closer to harmful features than those from non-inducing data.
27