RESEARCH27

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

arXiv CS.LG·May 18, 2026

This paper introduces Group-Query Latent Attention (GQLA), a modification to Multi-head Latent Attention (MLA). GQLA exposes two algebraically equivalent decoding paths, allowing a single set of trained weights to adapt efficiently to different hardware platforms like H100 and H20 without retraining.

deep learning Attention Mechanism AI Efficiency hardware optimization LLM

Read original ↗