RESEARCHarXiv CS.CL·14d ago
AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue
This paper introduces AERIC, a novel transfer-oriented hidden-state approach for anticipatory same-pass monitoring of implicit harmful dialogue in language models. It aims to detect potential risks early enough to prevent the exposure of harmful continuations.
30