RESEARCH29

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

arXiv CS.LG·June 1, 2026

This paper explores "deceptive alignment" in LLMs, a key challenge in AI safety where models deliberately produce false outputs while maintaining accurate internal representations. Researchers introduced a multi-model paradigm, successfully detecting synthetic dishonesty with high accuracy using linear probes across various transformer architectures.

LLMs machine learning deception AI safety Transformers

Read original ↗