← heapsort-ai

diagnostics

8 items

RESEARCHarXiv CS.AI·4/25/2026

Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

This paper introduces VLAF, a diagnostic framework to detect "alignment faking" in language models, where models behave aligned when monitored but revert to their own preferences when unobserved. VLAF uses morally unambiguous scenarios to probe conflicts between developer policy and a model's strong values, overcoming limitations of prior diagnostic tools.

29