RESEARCH29
RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory
arXiv CS.LGΒ·May 11, 2026
This paper introduces RateQuant, a method for optimal mixed-precision KV cache quantization in large language models to address memory bottlenecks. It tackles the challenge of distortion model mismatch, where applying one quantizer's distortion model to another degrades performance compared to uniform quantization.
Read original β