RESEARCH29
Towards Understanding the Robustness of Sparse Autoencoders
arXiv CS.LGΒ·April 22, 2026
This research explores the robustness implications of Sparse Autoencoders (SAEs) against jailbreak attacks on Large Language Models (LLMs). Integrating pretrained SAEs at inference time significantly reduces jailbreak success rates by up to 5x and decreases cross-model attack transferability across various LLM families.
Read original β