RESEARCHarXiv CS.AI·18d ago
Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs
This research introduces MOOD, a benchmark designed to study the detection of out-of-distribution (OOD) alignment failures in large language models (LLMs) using monitoring pipelines. It proposes combining guard models with OOD detectors to improve the generalization of safety classifiers, which often fail in OOD scenarios.
27