quantization

57 items

ARTICLE↑ trendingReddit r/LocalLLaMA·4/14/2026

Gemma 4 31B — 4bit is all you need

This content compares the performance of Gemma 4 31B's 4-bit and 8-bit quantized versions on an M5 Max MacBook Pro, surprisingly finding the 4-bit version scored higher (91.3% vs 88.4%). It also notes an issue where Gemma 4 26B-A4B entered a regression loop, truncating responses after hitting the max token limit of 16,384.

4bit 8bit Gemma quantization

ARTICLE↑ trendingReddit r/MachineLearning·4/23/2026

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

The user is optimizing a Transformer model for size and inference speed, having plateaued after FP16 conversion and ONNX optimization, with pruning yielding limited gains. They are seeking advice on advanced techniques like low-rank factorization, aggressive quantization (INT8/INT4), knowledge distillation, or hardware-specific optimizations to achieve further real-world improvements.

Pruning inference Transformer quantization

NEWS↑ trendingReddit r/MachineLearning·4/21/2026

We open-sourced Chaperone-Thinking-LQ-1.0 — a 4-bit GPTQ + QLoRA fine-tuned DeepSeek-R1-32B that hits 84% on MedQA in ~20GB[N]

Chaperone-Thinking-LQ-1.0, a 4-bit GPTQ + QLoRA fine-tuned DeepSeek-R1-32B model, has been open-sourced. It achieves 84% accuracy on MedQA, close to GPT-4o, while being only ~20GB in size and 1.6x faster than the base model.

Open Source Benchmarking quantization Fine-tuning

NEWS↑ trendingReddit r/MachineLearning·4/22/2026

INT3 compression+fused metal kernels [R]

A solo founder developed INT3 model compression and a 2-bit KV cache with custom fused Metal kernels for Mac (M-series). Qwen 7B is available in preview, and further optimizations and GPU support are planned.

Hardware Acceleration LLMs quantization model optimization

ARTICLE↑ trendingReddit r/MachineLearning·4/12/2026

ArcFace embeddings quantized to 16-bit pgvector HALFVEC ? [D]

The content discusses optimizing 512-dimension ArcFace embeddings in PostgreSQL, which exceed TOAST limits and increase I/O. It proposes quantizing them to 16-bit (HALFVEC) to halve storage and I/O, while questioning the impact on precision.

quantization pgvector embeddings PostgreSQL

RESEARCH↑ trendingReddit r/LocalLLaMA·4/22/2026

Personal Eval follow-up: Gemma4 26B MoE (Q8) vs Qwen3.5 27B Dense vs Gemma4 31B Dense Compared

This follow-up compares Gemma4 26B MoE (Q8), Qwen3.5 27B Dense, and Gemma4 31B Dense models, including previous Qwen 3.6 35B and Gemma 4 26B (Q4) results. The analysis benchmarks their performance, highlighting the impact of 8-bit quantization and the effectiveness of different model architectures.

Benchmarking Gemma model comparison quantization

RESEARCH↑ trendingReddit r/LocalLLaMA·4/18/2026

Qwen3.6-35B-A3B-Uncensored-Wasserstein-GGUF

A user discovered and fixed a significant tensor drift issue in the `ssm_conv1d` layers of quantized Qwen3.6-35B GGUF models, proposing the Wasserstein metric as superior to Kullback Leibler for detecting numerical instability. The fix, which specifically targets recurrent state transition layers responsible for long-context memory, is now available in a shared model.

LLMs quantization GGUF model optimization

DOC↑ trendingReddit r/LocalLLaMA·5/6/2026

Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

This content details the implementation of Multi-Token Prediction (MTP) with quantized GGUFs for Qwen3-27B, utilizing Unsloth's UD XL quantizations with Q8_0 MTP layers grafted on top, resulting in a 2.5x throughput increase. The author shares grafted GGUF files, raw MTP layer source, and a conversion script, along with custom llama.cpp build instructions incorporating speculative decoding support from an unmerged PR.

Multi-Token Prediction llama.cpp quantization large language models

ARTICLE↑ trendingReddit r/LocalLLaMA·4/13/2026

Experiment: Olmo 3 7B Instruct Q1_0

The author attempted to quantize OLMo-3 7B Instruct into a 1-bit format using quantization aware distillation, training the model for 12 hours on 4x B200 GPUs. Although the resulting model can produce basic English, it's generally unusable due to repetition loops and lack of context tracking, attributed to premature training cessation and an unsuitable dataset choice.

OLMo-3 distillation quantization 1-bit model

RESEARCH↑ trendingReddit r/LocalLLaMA·26d ago

A First Comprehensive Study of TurboQuant: Accuracy and Performance

A comprehensive study on TurboQuant compares its variants (k8v4, 4bit-nc, k3v4-nc, 3bit-nc) with FP8 for KV-cache quantization. FP8 is recommended as the default, offering 2x capacity with negligible accuracy loss and good performance. TurboQuant variants show limited advantages or significant degradation in accuracy and performance, with 4bit-nc being an option for memory-constrained scenarios.

AI models TurboQuant Performance optimization FP8

A First Comprehensive Study of TurboQuant: Accuracy and Performance

DOC↑ trendingReddit r/MachineLearning·4/22/2026

Need Info on quality benchmarks to run on DeepSeek V3.2 different quant levels [D]

A user is seeking advice on what quality benchmarks to run to measure the performance degradation when applying runtime quantization to the DeepSeek V3.2 large language model. The goal is to compare the quality loss against the non-quantized version.

Benchmarking quantization model optimization AI evaluation

NEWS↑ trendingReddit r/LocalLLaMA·4/10/2026

Update on Gemma 4 having MTP: Reverse engineering effort

O autor extraiu os pesos do modelo Gemma 4 em arquivos TFLite e agora busca ajuda da comunidade, especialmente de especialistas em C++, para convertê-los em um módulo Pytorch. O processo envolve desafios como a desquantização INT8 e a exploração de ferramentas como o Google AI Edge Model explorer.

Gemma 4 machine learning quantization model conversion

ARTICLE↑ trendingReddit r/LocalLLaMA·4/14/2026

MiniMax M2.7 GGUF Investigation, Fixes, Benchmarks

An investigation into MiniMax-M2.7 GGUF revealed that perplexity NaNs affect 21-38% of GGUFs on Hugging Face. The issue was traced to overflowing in llama.cpp, specifically in `blk.61.ffn_down_exps` for Q5_K and Q4_K quantizations, and the team has fixed theirs.

Perplexity NaNs quantization GGUF

MiniMax M2.7 GGUF Investigation, Fixes, Benchmarks

RESEARCH↑ trendingReddit r/LocalLLaMA·5/7/2026

ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

ParoQuant is a novel technique that employs pairwise rotation quantization to significantly improve the efficiency of Large Language Model (LLM) inference. This method specifically targets reasoning LLMs, enabling more cost-effective and faster deployment by reducing computational and memory requirements.

Optimization LLMs efficiency quantization

ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

ARTICLE↑ trendingReddit r/LocalLLaMA·25d ago

Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct

The author investigates why a specific Qwen3.6 27B INT8 Autoround quantization recipe outperforms others, observing the model "thinks" less but provides better outputs in benchmarks. They then replicated this performance with a new GGUF quant, noting both consistently achieve answers faster than UD Q8 K XL.

AI models Qwen3.6 Performance optimization quantization

RESEARCH↑ trendingReddit r/LocalLLaMA·4/21/2026

PrismML — Introducing Ternary Bonsai: Top Intelligence at 1.58 Bits

This content introduces PrismML and a new AI concept called Ternary Bonsai, claiming to achieve top intelligence with remarkable efficiency at 1.58 bits. It likely discusses advancements in AI model compression or optimized performance.

AI models model efficiency machine learning quantization

PrismML — Introducing Ternary Bonsai: Top Intelligence at 1.58 Bits

ARTICLE↑ trendingReddit r/LocalLLaMA·18d ago

[llama.cpp] Asymmetric KV q8/q4 cache: current caveats and discussion in GGML repo

This content addresses a challenge in llama.cpp concerning asymmetric KV q8/q4 cache quantization, which can lead to CPU processing on CUDA. A GitHub discussion highlights a solution involving compiling with a specific KV cache quant combo, offering substantial memory savings with only a 1.3% precision loss.

llama.cpp GPU optimization quantization KV cache

RESEARCH↑ trendingReddit r/MachineLearning·4/11/2026

What if your HNSW index stored 3-bit embeddings instead of float32? [R]

O texto explora uma abordagem experimental para indexação de vetores HNSW que utiliza embeddings quantizados de 3 bits, em vez de float32, para reduzir o uso de memória. A técnica, baseada em PolarQuant, permite cálculo de distância eficiente via tabelas pré-computadas, resultando em economia de memória e bom recall, apesar de um processo de construção mais lento e desafios com o ruído de quantização.

HNSW Memory Optimization quantization Vector Indexing

NEWS↑ trendingReddit r/LocalLLaMA·4/17/2026

Qwen3.6-35B-A3B Uncensored Aggressive is out with K_P quants!

The Qwen3.6-35B-A3B "Aggressive" variant has been released, offering an uncensored version of the original model with no refusals and zero capability loss. This release includes various K_P quants and vision support.

uncensored AI quantization Qwen model release

RESEARCH↑ trendingReddit r/LocalLLaMA·4/14/2026

Updated Qwen3.5-9B Quantization Comparison

This content compares various GGUF quantizations of the Qwen3.5-9B model using KL Divergence (KLD) to assess faithfulness to the BF16 baseline. The goal is to provide users with a data-driven basis for selecting the most faithful quantized file, where lower KLD scores indicate less information loss.

Qwen3.5-9B KLD quantization GGUF

Updated Qwen3.5-9B Quantization Comparison