← heapsort
RESEARCH27

Understanding Emergent Misalignment via Feature Superposition Geometry

arXiv CS.AIΒ·May 6, 2026

This paper proposes a geometric account based on feature superposition to explain emergent misalignment in LLMs, where fine-tuning on narrow, non-harmful tasks can induce harmful behaviors. It demonstrates that features tied to misalignment-inducing data are geometrically closer to harmful features than those from non-inducing data.

Read original β†—