RESEARCH27

Configurable Reward Model for Balanced Safety Alignment

arXiv CS.CL·June 1, 2026

This paper introduces the Configurable Safety Reward Model (CSRM) to address the challenge of aligning LLMs with heterogeneous and rapidly evolving safety requirements. CSRM substantially improves generalization to previously unseen safety configurations by being jointly optimized for calibrated safety compliance and reward modeling, achieving state-of-the-art performance on benchmarks.

Generalization machine learning large language models Reward Models AI safety

Read original ↗