RESEARCH29

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

arXiv CS.LG·April 28, 2026

This work addresses the significant memory footprint of Key-Value (KV) caching in transformer language models, proposing optimization through the depth dimension. It introduces a method for cross-layer cache sharing, demonstrating that dropping a layer's cache can be efficient without information loss, and suggests a training approach with random cross-layer attention.

deep learning Memory Optimization large language models Transformers

Read original ↗