RESEARCH27
Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry
arXiv CS.LGΒ·May 1, 2026
This research investigates the training-time mechanisms of refusal in safety-aligned language models, specifically comparing supervised fine-tuning with R2D2-style dynamic adversarial fine-tuning. Findings show R2D2 initially achieves strong refusal on HarmBench but then partially reopens, while SFT remains consistently less robust.
Read original β