RESEARCH27

Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention

arXiv CS.LG·April 24, 2026

This paper introduces Gist Sparse Attention (GSA), an end-to-end learnable method to scale large language models to long contexts without architectural modifications. GSA compresses context into 'gist tokens' for summary, then selectively restores relevant raw chunks for detailed attention, combining compact global representations with targeted fine-grained access.

neural networks model efficiency attention mechanisms large language models

Read original ↗