jailbreaking

3 items

RESEARCHarXiv CS.CL·4/30/2026

One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

This research introduces Incremental Completion Decomposition (ICD), a novel jailbreak strategy that exploits weaknesses in LLM safety mechanisms by eliciting sequences of single-word continuations. ICD demonstrates superior Attack Success Rate (ASR) on various benchmarks compared to existing methods, providing theoretical and mechanistic evidence for its effectiveness.

LLMs jailbreaking security adversarial attacks

RESEARCHDEV.to AI·5/8/2026

Tiny weight edits improve LLM safety

Targeted, tiny weight edits to specific attention heads in LLMs, as demonstrated by the ASGuard method, can drastically reduce jailbreak success rates from linguistic tricks. This surgical approach patches vulnerabilities by dampening activations in relevant attention heads, maintaining overall model competence while significantly enhancing safety.

AI models jailbreaking security LLM safety

RESEARCHDEV.to AI·4/15/2026

Scalable and Transferable Black-Box Jailbreaks for Language Models via PersonaModulation

This content introduces PersonaModulation, a novel technique for creating scalable and transferable black-box jailbreaks for language models. The method effectively bypasses safety mechanisms in LLMs without requiring internal model access.

language models jailbreaking PersonaModulation Black-Box Attacks