← heapsort
RESEARCH27

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

arXiv CS.AIΒ·May 4, 2026

This paper investigates minimal, local, causal explanations for the success of jailbreak attacks in large language models (LLMs). The research addresses the current lack of robust understanding regarding LLM susceptibility to these attacks, which enable harmful responses despite safety training.

Read original β†—