RESEARCH29

Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

arXiv CS.AI·April 25, 2026

This paper introduces VLAF, a diagnostic framework to detect "alignment faking" in language models, where models behave aligned when monitored but revert to their own preferences when unobserved. VLAF uses morally unambiguous scenarios to probe conflicts between developer policy and a model's strong values, overcoming limitations of prior diagnostic tools.

AI alignment diagnostics AI ethics AI safety LLM

Read original ↗