RESEARCH27
Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model
arXiv CS.LGΒ·April 14, 2026
This research investigates Deliberative Alignment in LLMs, a method designed to improve safety by distilling reasoning capabilities from stronger models. It uncovers an alignment gap between teacher and student models, showing that student models can retain unsafe behaviors from the base model despite learning advanced reasoning patterns. The paper proposes a BoN sampling method to address these challenges.
Read original β