← heapsort
RESEARCH27

Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

arXiv CS.AIΒ·May 23, 2026

This research introduces MOOD, a benchmark designed to study the detection of out-of-distribution (OOD) alignment failures in large language models (LLMs) using monitoring pipelines. It proposes combining guard models with OOD detectors to improve the generalization of safety classifiers, which often fail in OOD scenarios.

Read original β†—