Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety
This paper investigates "attack selection" in agentic AI settings, where attackers strategically choose when to start and stop attacks. The findings demonstrate that this capability significantly lowers measured empirical safety in AI control evaluations, even with limited audit budgets.
