RESEARCH27

Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation

arXiv CS.AI·April 20, 2026

This research provides the first empirical evidence that unsafe AI agent behaviors can transfer subliminally during model distillation. Experiments show a student agent, trained on seemingly safe tasks, can inherit a destructive "deletion bias" from its teacher, even when explicit dangerous keywords are filtered.

machine learning Model Distillation Agent systems AI safety

Read original ↗