quantization

57 items

RESEARCH↑ trendingReddit r/LocalLLaMA·4/18/2026

Qwen 3.6 35B A3B Q4_K_M quant evaluation

This content evaluates the performance of the Qwen 3.6 35B A3B Q4_K_M quantized MoE model on CPU, using benchmarks like HumanEval, HellaSwag, and BFCL. It achieved 22 tokens/sec, showing strong performance in commonsense reasoning (74%) and solid results for an active 3B MoE model.

AI model evaluation Benchmarking quantization MoE

Qwen 3.6 35B A3B Q4_K_M quant evaluation

RESEARCH↑ trendingReddit r/LocalLLaMA·5/6/2026

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

This content compares the quality of various Qwen 3.6 27B model quantizations using a custom chess game test to find the optimal one for 16 GB VRAM setups. It evaluates the models' ability to track board states and generate accurate SVG images of the chessboard.

VRAM Benchmarking quantization model quality

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

ARTICLE↑ trendingReddit r/LocalLLaMA·28d ago

I got a real transformer language model running locally on a stock Game Boy Color!

A transformer language model (TinyStories-260K) was successfully run locally on a stock Game Boy Color, utilizing INT8 weights and fixed-point math. This impressive technical feat involved a custom ROM and on-device tokenization, though performance is extremely slow and output is gibberish.

Hardware Acceleration Edge AI quantization AI inference

I got a real transformer language model running locally on a stock Game Boy Color!

ARTICLE↑ trendingReddit r/LocalLLaMA·4/24/2026

Takeaways & discussion about the DeepSeek V4 architecture

This article discusses the architectural novelties of DeepSeek V4, highlighting its hybrid attention system (CSA + HCA) and Manifold-Constrained Hyper-Connections. It also touches on frontier-scale FP4 QAT training, differentiating it from previous models.

DeepSeek deep learning attention mechanisms quantization

RESEARCH↑ trendingReddit r/LocalLLaMA·4/17/2026

Qwen3.6 GGUF Benchmarks

This content presents KLD performance benchmarks for Unsloth's Qwen3.6-35B-A3B GGUF quants, highlighting their efficiency in terms of KLD versus disk space. It also clarifies that frequent GGUF updates are typically due to external bug fixes or official improvements, rather than Unsloth's internal errors.

LLMs quantization Benchmarks

CASE↑ trendingReddit r/MachineLearning·4/27/2026

INT8 quantization gives me better accuracy than FP16 ! [D]

A user observed that INT8 quantization in their deep learning model yielded better inference accuracy than FP16, which was unexpected. They are seeking explanations for INT8's superior performance over FP16.

inference ONNX deep learning quantization

NEWS↑ trendingReddit r/LocalLLaMA·4/27/2026

AMD Hipfire - a new inference engine optimized for AMD GPU's

The content introduces Hipfire, a new inference engine optimized for all AMD GPUs, utilizing a special mq4 quantization method. Initial benchmarks from Localmaxxing show dramatic speedups, although the creator clarifies it's not officially affiliated with AMD.

Benchmarking GPU optimization AMD quantization

NEWS↑ trendingReddit r/LocalLLaMA·4/15/2026

What is the current status with Turbo Quant?

This content inquires about the current status of "Turbo Quant" technology, referencing its hype approximately two weeks prior and mentions of pull requests into llama.cpp. The user is seeking an update on its development and adoption.

Turbo Quant llama.cpp quantization AI development

RESEARCHarXiv CS.LG·1d ago

FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models

Diffusion Large Language Models (dLLMs) face a "stability lag" due to irreversible token commitment, a problem exacerbated by Post-Training Quantization (PTQ) errors. FAIR-Calib proposes a two-stage PTQ framework that uses a position prior and layer-wise calibration to protect fragile frontier states, enhancing quantization for dLLMs.

Diffusion Models post-training quantization quantization AI calibration

ARTICLE↑ trendingReddit r/LocalLLaMA·4/15/2026

Major drop in intelligence across most major models.

The author reports a major drop in intelligence across several major AI models like ChatGPT, Claude, Gemini, and Grok, as of mid-April 2026. They observed models ignoring instructions and giving shallow outputs, hypothesizing quantization reduction or a deliberate policy, and suggesting using rented GPUs or local AI.

quantization Local AI model degradation AI intelligence drop

ARTICLEDEV.to AI·4/19/2026

The Rise of Inference Optimization: The Real LLM Infra Trend Shaping 2026

The content highlights inference optimization as the critical trend shaping LLM infrastructure by 2026, emphasizing its importance over model size. It explains that while training is a one-time cost, inference is an ongoing expense that directly impacts margins and user experience, making efficiency paramount.

quantization AI infrastructure Inference Optimization Cost Efficiency

RESEARCHarXiv CS.LG·29d ago

RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory

This paper introduces RateQuant, a method for optimal mixed-precision KV cache quantization in large language models to address memory bottlenecks. It tackles the challenge of distortion model mismatch, where applying one quantizer's distortion model to another degrades performance compared to uniform quantization.

Memory Optimization quantization AI Research LLM

RESEARCHarXiv CS.LG·5/6/2026

eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization

eOptShrinkQ is a two-stage compression pipeline for KV cache in transformer attention heads. It leverages optimal singular value shrinkage and per-vector scalar quantization, grounded in random matrix theory, to achieve near-lossless compression and improve reconstruction.

quantization Random matrix theory AI compression KV cache

RESEARCHarXiv CS.LG·5d ago

LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection

LiftQuant is a novel framework for continuous bit-width control in Large Language Models, addressing limitations of integer-based quantization. It employs a "lift-then-project" mechanism to achieve quasi-continuous bit-width tuning for optimal deployment.

Model Compression neural networks LLMs deep learning

RESEARCHarXiv CS.LG·4/8/2026

Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression

Este artigo propõe um pipeline ordenado (poda, quantização INT8 e destilação de conhecimento) para otimizar a compressão de redes neurais, visando a latência de inferência medida em vez de métricas indiretas. A pesquisa revela que a quantização INT8 oferece o principal benefício de tempo de execução, enquanto a poda atua como um pré-condicionador e a destilação de conhecimento recupera a precisão.

Pruning Knowledge Distillation model efficiency Neural Network Compression

DOCDEV.to AI·10d ago

How to Deploy Qwen2.5 72B with vLLM + AWQ Quantization on a $24/Month DigitalOcean GPU Droplet: Multilingual Reasoning at 1/110th Claude Opus Cost

This guide details how to deploy Qwen2.5 72B with vLLM and AWQ quantization on a DigitalOcean GPU Droplet for just $24/month. It demonstrates significant cost reduction compared to commercial AI APIs like Claude Opus, offering enterprise-grade multilingual reasoning at a fraction of the price.

deployment quantization Cost Optimization DigitalOcean

RESEARCHarXiv CS.CL·19d ago

Improving Quantized Model Performance in Qualitative Analysis with Multi-Pass Prompt Verification

This research examines how various lower-bit quantization levels impact LLaMA-3.1's performance in qualitative analysis, noting that low-bit models often produce hallucinations. It proposes a quantization-aware multi-pass prompt verification method to enhance accuracy by systematically reducing hallucinations and filtering unreliable content.

model performance Qualitative Analysis LLMs hallucinations

RESEARCHDEV.to AI·28d ago

Federated Learning With Quantized Global Model Updates

This content explores the technique of federated learning, specifically focusing on how quantized global model updates can optimize its efficiency. It likely delves into methods for reducing communication overhead and computational costs in distributed machine learning environments.

Model updates machine learning quantization federated learning

ARTICLEDEV.to AI·15d ago

Most people starting with local LLMs jump straight to 4-bit quantization because it's fast and uses

This article compares 16-bit, 8-bit, and 4-bit LLM quantization, revealing that 4-bit, while faster, significantly compromises quality on reasoning and math tasks. The real trade-off is between the task and required precision, with 8-bit being optimal for precision-demanding tasks, offering minimal quality loss with only a slight speed reduction. Quantization choice should be based on the task and hardware considerations, not solely on hardware.

inference speed model performance quantization hardware

RESEARCHarXiv CS.LG·7d ago

BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization

BitsMoE proposes a spectral-energy-guided bit-allocation framework for quantizing Mixture-of-Experts (MoE) large language models. It addresses memory-intensive deployment by decomposing MoE layers and using expert-specific spectral factors for fine-grained, activation-aware mixed-precision quantization.

MoE models deep learning AI optimization quantization