Adaptive Computation Depth via Learned Token Routing in Transformers
This paper introduces Token-Selective Attention (TSA), a mechanism for Transformer architectures that enables adaptive computation depth per token. TSA learns to route tokens based on contextual difficulty, saving 14-23% of token-layer operations with minimal quality loss.