One hidden neuron can disable safety guards
This study reveals that safety layers in large language models can be disabled by flipping a single hidden neuron. This minimal intervention works across various model families and scales, challenging the assumption that alignment is robustly spread throughout the network.