RESEARCH27
Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs
arXiv CS.AIΒ·May 23, 2026
This research introduces MOOD, a benchmark designed to study the detection of out-of-distribution (OOD) alignment failures in large language models (LLMs) using monitoring pipelines. It proposes combining guard models with OOD detectors to improve the generalization of safety classifiers, which often fail in OOD scenarios.
Read original β