RESEARCH27
GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding
arXiv CS.LGΒ·May 18, 2026
This paper introduces Group-Query Latent Attention (GQLA), a modification to Multi-head Latent Attention (MLA). GQLA exposes two algebraically equivalent decoding paths, allowing a single set of trained weights to adapt efficiently to different hardware platforms like H100 and H20 without retraining.
Read original β