← heapsort-ai

adversarial attacks

4 items

RESEARCHarXiv CS.CL·4/30/2026

One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

This research introduces Incremental Completion Decomposition (ICD), a novel jailbreak strategy that exploits weaknesses in LLM safety mechanisms by eliciting sequences of single-word continuations. ICD demonstrates superior Attack Success Rate (ASR) on various benchmarks compared to existing methods, providing theoretical and mechanistic evidence for its effectiveness.

29
RESEARCHarXiv CS.LG·21d ago

When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning

This research investigates adversarial action masking in self-play reinforcement learning, where an attacker selectively removes legal actions from a victim's action set. The study found that learned masking causes significantly more damage than random masking or perturbation baselines, highlighting action availability as a critical robustness surface in self-play RL.

27