RESEARCH27

Understanding Emergent Misalignment via Feature Superposition Geometry

arXiv CS.AI·May 6, 2026

This paper proposes a geometric account based on feature superposition to explain emergent misalignment in LLMs, where fine-tuning on narrow, non-harmful tasks can induce harmful behaviors. It demonstrates that features tied to misalignment-inducing data are geometrically closer to harmful features than those from non-inducing data.

feature superposition LLMs machine learning misalignment AI safety

Read original ↗