← heapsort-ai

quantization

57 items

ARTICLE↑ trendingReddit r/LocalLLaMA·4/14/2026

Gemma 4 31B — 4bit is all you need

This content compares the performance of Gemma 4 31B's 4-bit and 8-bit quantized versions on an M5 Max MacBook Pro, surprisingly finding the 4-bit version scored higher (91.3% vs 88.4%). It also notes an issue where Gemma 4 26B-A4B entered a regression loop, truncating responses after hitting the max token limit of 16,384.

Gemma 4 31B — 4bit is all you need
67
ARTICLE↑ trendingReddit r/MachineLearning·4/23/2026

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

The user is optimizing a Transformer model for size and inference speed, having plateaued after FP16 conversion and ONNX optimization, with pruning yielding limited gains. They are seeking advice on advanced techniques like low-rank factorization, aggressive quantization (INT8/INT4), knowledge distillation, or hardware-specific optimizations to achieve further real-world improvements.

50
RESEARCH↑ trendingReddit r/LocalLLaMA·4/18/2026

Qwen3.6-35B-A3B-Uncensored-Wasserstein-GGUF

A user discovered and fixed a significant tensor drift issue in the `ssm_conv1d` layers of quantized Qwen3.6-35B GGUF models, proposing the Wasserstein metric as superior to Kullback Leibler for detecting numerical instability. The fix, which specifically targets recurrent state transition layers responsible for long-context memory, is now available in a shared model.

44
DOC↑ trendingReddit r/LocalLLaMA·5/6/2026

Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

This content details the implementation of Multi-Token Prediction (MTP) with quantized GGUFs for Qwen3-27B, utilizing Unsloth's UD XL quantizations with Q8_0 MTP layers grafted on top, resulting in a 2.5x throughput increase. The author shares grafted GGUF files, raw MTP layer source, and a conversion script, along with custom llama.cpp build instructions incorporating speculative decoding support from an unmerged PR.

43
ARTICLE↑ trendingReddit r/LocalLLaMA·4/13/2026

Experiment: Olmo 3 7B Instruct Q1_0

The author attempted to quantize OLMo-3 7B Instruct into a 1-bit format using quantization aware distillation, training the model for 12 hours on 4x B200 GPUs. Although the resulting model can produce basic English, it's generally unusable due to repetition loops and lack of context tracking, attributed to premature training cessation and an unsuitable dataset choice.

Experiment: Olmo 3 7B Instruct Q1_0
43
RESEARCH↑ trendingReddit r/LocalLLaMA·26d ago

A First Comprehensive Study of TurboQuant: Accuracy and Performance

A comprehensive study on TurboQuant compares its variants (k8v4, 4bit-nc, k3v4-nc, 3bit-nc) with FP8 for KV-cache quantization. FP8 is recommended as the default, offering 2x capacity with negligible accuracy loss and good performance. TurboQuant variants show limited advantages or significant degradation in accuracy and performance, with 4bit-nc being an option for memory-constrained scenarios.

A First Comprehensive Study of TurboQuant: Accuracy and Performance
43
ARTICLE↑ trendingReddit r/LocalLLaMA·25d ago

Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct

The author investigates why a specific Qwen3.6 27B INT8 Autoround quantization recipe outperforms others, observing the model "thinks" less but provides better outputs in benchmarks. They then replicated this performance with a new GGUF quant, noting both consistently achieve answers faster than UD Q8 K XL.

42
RESEARCH↑ trendingReddit r/MachineLearning·4/11/2026

What if your HNSW index stored 3-bit embeddings instead of float32? [R]

O texto explora uma abordagem experimental para indexação de vetores HNSW que utiliza embeddings quantizados de 3 bits, em vez de float32, para reduzir o uso de memória. A técnica, baseada em PolarQuant, permite cálculo de distância eficiente via tabelas pré-computadas, resultando em economia de memória e bom recall, apesar de um processo de construção mais lento e desafios com o ruído de quantização.

42
RESEARCH↑ trendingReddit r/LocalLLaMA·4/14/2026

Updated Qwen3.5-9B Quantization Comparison

This content compares various GGUF quantizations of the Qwen3.5-9B model using KL Divergence (KLD) to assess faithfulness to the BF16 baseline. The goal is to provide users with a data-driven basis for selecting the most faithful quantized file, where lower KLD scores indicate less information loss.

Updated Qwen3.5-9B Quantization Comparison
42