RESEARCH27

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

arXiv CS.CL·April 22, 2026

This empirical study investigates jailbreak detection in large language models, showing that single output evaluation systematically underestimates vulnerability. Increasing the number of sampled generations, especially from one to moderate sampling, significantly improves the detection of harmful behavior.

LLMs security AI safety

Read original ↗