RESEARCH27
Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models
arXiv CS.AIΒ·May 4, 2026
This paper investigates minimal, local, causal explanations for the success of jailbreak attacks in large language models (LLMs). The research addresses the current lack of robust understanding regarding LLM susceptibility to these attacks, which enable harmful responses despite safety training.
Read original β