RESEARCH27

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

arXiv CS.LG·May 8, 2026

This paper introduces sparse prefix caching, an optimization for LLM serving that stores recurrent states at checkpoint positions rather than requiring the entire token history. The method consistently improves the Pareto frontier compared to standard heuristics, especially for use cases where requests share a non-trivial prefix.

LLMs AI infrastructure Caching performance State Space Models

Read original ↗