RESEARCH29

RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory

arXiv CS.LG·May 11, 2026

This paper introduces RateQuant, a method for optimal mixed-precision KV cache quantization in large language models to address memory bottlenecks. It tackles the challenge of distortion model mismatch, where applying one quantizer's distortion model to another degrades performance compared to uniform quantization.

Memory Optimization quantization AI Research LLM

Read original ↗