Do Transformers Need Three Projections? Systematic Study of QKV Variants
This research systematically evaluates variants of the Query, Key, and Value (QKV) attention formulation in Transformers, including shared key-value, query-key, and single projections. Experiments across synthetic, vision, and language modeling tasks demonstrate that these alternative formulations perform on par or occasionally better than standard QKV Transformers, with Q-K=V sharing offering significant KV cache reduction in language modeling.