When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception
This paper explores "deceptive alignment" in LLMs, a key challenge in AI safety where models deliberately produce false outputs while maintaining accurate internal representations. Researchers introduced a multi-model paradigm, successfully detecting synthetic dishonesty with high accuracy using linear probes across various transformer architectures.