← heapsort
RESEARCH27

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

arXiv CS.LGΒ·May 1, 2026

This research investigates the training-time mechanisms of refusal in safety-aligned language models, specifically comparing supervised fine-tuning with R2D2-style dynamic adversarial fine-tuning. Findings show R2D2 initially achieves strong refusal on HarmBench but then partially reopens, while SFT remains consistently less robust.

Read original β†—