RESEARCH27

Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model

arXiv CS.LG·April 14, 2026

This research investigates Deliberative Alignment in LLMs, a method designed to improve safety by distilling reasoning capabilities from stronger models. It uncovers an alignment gap between teacher and student models, showing that student models can retain unsafe behaviors from the base model despite learning advanced reasoning patterns. The paper proposes a BoN sampling method to address these challenges.

Model Alignment LLMs Deliberative Alignment Reasoning AI safety

Read original ↗