misalignment — AI articles, news & research

RESEARCHarXiv CS.AI·5/6/2026

Understanding Emergent Misalignment via Feature Superposition Geometry

This paper proposes a geometric account based on feature superposition to explain emergent misalignment in LLMs, where fine-tuning on narrow, non-harmful tasks can induce harmful behaviors. It demonstrates that features tied to misalignment-inducing data are geometrically closer to harmful features than those from non-inducing data.

feature superposition LLMs machine learning misalignment