← heapsort-ai

deception

2 items

RESEARCHarXiv CS.LG·8d ago

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

This paper explores "deceptive alignment" in LLMs, a key challenge in AI safety where models deliberately produce false outputs while maintaining accurate internal representations. Researchers introduced a multi-model paradigm, successfully detecting synthetic dishonesty with high accuracy using linear probes across various transformer architectures.

29
RESEARCHarXiv CS.CL·15d ago

Evaluating Large Language Models in a Complex Hidden Role Game

This research quantifies the deceptive potential of Large Language Models (LLMs) in the social deduction game Secret Hitler, introducing novel metrics and an open-source framework. The study benchmarks LLMs against rule-based algorithms and human games, revealing a gap between conversational ability and strategic depth, and showing that reasoning-enhancement techniques can worsen performance for fascist roles.

27