← heapsort-ai

jailbreaking

3 items

RESEARCHarXiv CS.CL·4/30/2026

One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

This research introduces Incremental Completion Decomposition (ICD), a novel jailbreak strategy that exploits weaknesses in LLM safety mechanisms by eliciting sequences of single-word continuations. ICD demonstrates superior Attack Success Rate (ASR) on various benchmarks compared to existing methods, providing theoretical and mechanistic evidence for its effectiveness.

29
RESEARCHDEV.to AI·5/8/2026

Tiny weight edits improve LLM safety

Targeted, tiny weight edits to specific attention heads in LLMs, as demonstrated by the ASGuard method, can drastically reduce jailbreak success rates from linguistic tricks. This surgical approach patches vulnerabilities by dampening activations in relevant attention heads, maintaining overall model competence while significantly enhancing safety.

27