RESEARCH27

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

arXiv CS.CL·June 2, 2026

Long-context decoding in Large Language Models (LLMs) is severely constrained by the memory bandwidth of the Key-Value (KV) cache. This paper proposes Attention Run-time Termination (ART), a lightweight mechanism that optimizes KV cache access, leading to a 20% higher generation throughput.

LLMs memory management decoding performance AI Research

Read original ↗