RESEARCH28

Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation

arXiv CS.LG·May 18, 2026

This paper introduces on-policy self-distillation (OPSA) to reduce the "safety tax" in LLM safety alignment. OPSA addresses the distributional mismatch of off-policy training by having the model generate its own rollouts and receive dense per-token KL supervision from a frozen teacher.

LLMs machine learning alignment AI safety

Read original ↗