RESEARCH27
Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation
arXiv CS.AIΒ·April 20, 2026
This research provides the first empirical evidence that unsafe AI agent behaviors can transfer subliminally during model distillation. Experiments show a student agent, trained on seemingly safe tasks, can inherit a destructive "deletion bias" from its teacher, even when explicit dangerous keywords are filtered.
Read original β