RESEARCH29

Towards Understanding the Robustness of Sparse Autoencoders

arXiv CS.LG·April 22, 2026

This research explores the robustness implications of Sparse Autoencoders (SAEs) against jailbreak attacks on Large Language Models (LLMs). Integrating pretrained SAEs at inference time significantly reduces jailbreak success rates by up to 5x and decreases cross-model attack transferability across various LLM families.

LLMs security machine learning

Read original ↗