RESEARCH27
An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models
arXiv CS.CLΒ·April 22, 2026
This empirical study investigates jailbreak detection in large language models, showing that single output evaluation systematically underestimates vulnerability. Increasing the number of sampled generations, especially from one to moderate sampling, significantly improves the detection of harmful behavior.
Read original β