The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
This systematic study of singular value spectra during transformer pretraining reveals three key phenomena: transient compression waves propagating through layers and persistent spectral gradients. It also identifies a Q/K–V functional asymmetry, where query/key projections drive depth-dependent dynamics while value/output projections compress uniformly.