[llama.cpp] Asymmetric KV q8/q4 cache: current caveats and discussion in GGML repo
This content addresses a challenge in llama.cpp concerning asymmetric KV q8/q4 cache quantization, which can lead to CPU processing on CUDA. A GitHub discussion highlights a solution involving compiling with a specific KV cache quant combo, offering substantial memory savings with only a 1.3% precision loss.