RESEARCH31
Expert-Aware Refusal Steering
arXiv CS.CLΒ·June 4, 2026
This paper extends refusal steering to Mixture-of-Experts (MoE) Large Language Models, finding that steering performance is not hindered by the MoE architecture. It proposes expert-aware refusal steering methods that leverage expert routing patterns, demonstrating that refusal behavior can be effectively steered based on a single expert's output.
Read original β