← heapsort-ai

quantization

57 items

ARTICLEDEV.to AI·29d ago

When I started running models locally, I thought quantization meant squeezing more into RAM. Turns o

The article advises against defaulting to Q4_K_M for local LLM inference, emphasizing that optimal performance comes from testing quantization levels tailored to specific workflows. It suggests that aggressive quantization like Q3_K_S can significantly cut latency with imperceptible quality loss for many tasks, though context length presents a trade-off.

27
RESEARCHarXiv CS.LG·4/24/2026

FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels

FairyFuse is a new inference system designed for CPU-only platforms, enabling multiplication-free execution of large language models. It uses ternary weights ({-1, 0, +1}) to replace floating-point multiplications with conditional additions and subtractions, significantly reducing memory bandwidth bottlenecks and offering up to 16x weight compression.

27
RESEARCHarXiv CS.LG·28d ago

Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant

This research analyzes three KV cache quantization schemes (KV, KQV, QKQV) and their impact on inner product variance, especially how QJL on K inflates it, amplified by softmax. Empirical findings highlight KQV's superior performance at a budget of n=4, an unconditional K-V asymmetry where QKQV is consistently worse than KQV in KL divergence, and budget-dependent crossovers for geometric K reconstruction.

27
RESEARCHarXiv CS.LG·5/7/2026

MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning

This research introduces MP-ISMoE, a Mixed-Precision Interactive Side Mixture-of-Experts framework, to enhance parameter-efficient transfer learning by mitigating memory overhead. It employs a Gaussian Noise Perturbed Iterative Quantization (GNP-IQ) scheme for lower-bit weight quantization, freeing up memory to improve side network learning capacity and performance.

27
RESEARCHarXiv CS.LG·20d ago

Theory-optimal Quantization Based on Flatness

This research models the relationship between quantization error and outliers in Large Language Models (LLMs) and introduces a new metric, Flatness, to quantify outlier distribution. Based on this, it derives a theoretical optimal solution and proposes Bidirectional Diagonal Quantization (BDQ) for post-training quantization.

27
RESEARCHarXiv CS.LG·27d ago

QuIDE: Mastering the Quantized Intelligence Trade-off via Active Optimization

QuIDE introduces a unified metric, the Intelligence Index I, to evaluate the efficiency of quantized neural networks by collapsing the compression-accuracy-latency trade-off. Experiments across various settings identify task-dependent optimal quantization (4-bit or 8-bit), providing a reproducible evaluation protocol and a fitness function for mixed-precision search.

27
RESEARCHarXiv CS.LG·22d ago

Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels

This study investigates the impact of post-training quantization on Large Language Models (LLMs) quality, revealing that compression can lead to bias emergence. 3-bit quantization caused 6-21% of previously unbiased items to develop new stereotypical behaviors in models like Qwen2.5-7B, Mistral-7B, and Phi-3.5-mini. This follows a clear dose-response pattern across various precision levels.

27
DOCDEV.to AI·14d ago

How to Deploy Llama 3.2 90B with vLLM + Quantization on a $20/Month DigitalOcean GPU Droplet: Enterprise Reasoning at 1/140th Claude Opus Cost

This content provides a guide on deploying the Llama 3.2 90B model using vLLM and quantization on a DigitalOcean GPU droplet, costing only $20/month. This setup offers enterprise-grade reasoning capabilities at a cost 25 times lower than Claude Opus, achieving significant cost savings for AI infrastructure.

27