adversarial attacks

4 items

RESEARCHarXiv CS.AI·1d ago

Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety

This paper investigates "attack selection" in agentic AI settings, where attackers strategically choose when to start and stop attacks. The findings demonstrate that this capability significantly lowers measured empirical safety in AI control evaluations, even with limited audit budgets.

security AI control Agentic AI adversarial attacks

RESEARCHarXiv CS.CL·4/30/2026

One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

This research introduces Incremental Completion Decomposition (ICD), a novel jailbreak strategy that exploits weaknesses in LLM safety mechanisms by eliciting sequences of single-word continuations. ICD demonstrates superior Attack Success Rate (ASR) on various benchmarks compared to existing methods, providing theoretical and mechanistic evidence for its effectiveness.

LLMs jailbreaking security adversarial attacks

RESEARCHarXiv CS.LG·21d ago

When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning

This research investigates adversarial action masking in self-play reinforcement learning, where an attacker selectively removes legal actions from a victim's action set. The study found that learned masking causes significantly more damage than random masking or perturbation baselines, highlighting action availability as a critical robustness surface in self-play RL.

reinforcement learning security self-play adversarial attacks

RESEARCHarXiv CS.LG·6d ago

Making Brain-Computer Interfaces More Secure

This study proposes a lightweight custom Convolutional Neural Network (CNN) architecture to investigate adversarial robustness in EEG-based Brain-Computer Interfaces (BCIs). The method is assessed using two EEG datasets and compared with other CNN models under gradient-based adversarial attack scenarios to ensure reliable BCI deployment.

neural networks brain-computer interfaces security machine learning