GGUF

16 items

RESEARCH↑ trendingReddit r/LocalLLaMA·4/18/2026

Qwen3.6-35B-A3B-Uncensored-Wasserstein-GGUF

A user discovered and fixed a significant tensor drift issue in the `ssm_conv1d` layers of quantized Qwen3.6-35B GGUF models, proposing the Wasserstein metric as superior to Kullback Leibler for detecting numerical instability. The fix, which specifically targets recurrent state transition layers responsible for long-context memory, is now available in a shared model.

LLMs quantization GGUF model optimization

DOC↑ trendingReddit r/LocalLLaMA·5/6/2026

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

This content details how to achieve 2.5x faster inference with Qwen 3.6 27B using MTP support in llama.cpp, enabling 28 tok/s on an M2 Max. It provides converted GGUF files for download, suitable for local agentic coding with 262k context on 48GB.

LLM optimization llama.cpp GGUF Qwen

ARTICLE↑ trendingReddit r/LocalLLaMA·4/14/2026

MiniMax M2.7 GGUF Investigation, Fixes, Benchmarks

An investigation into MiniMax-M2.7 GGUF revealed that perplexity NaNs affect 21-38% of GGUFs on Hugging Face. The issue was traced to overflowing in llama.cpp, specifically in `blk.61.ffn_down_exps` for Q5_K and Q4_K quantizations, and the team has fixed theirs.

Perplexity NaNs quantization GGUF

MiniMax M2.7 GGUF Investigation, Fixes, Benchmarks

DOC↑ trendingReddit r/LocalLLaMA·5/6/2026

Get faster qwen 3.6 27b

The content details how to achieve faster performance with the Qwen 3.6 27B model using llama.cpp on a 3090 GPU. It includes steps to apply a specific commit and `llama-server` setup commands to reach 50 t/s with 100k context.

llama.cpp AI optimization GPU performance GGUF

RESEARCH↑ trendingReddit r/LocalLLaMA·4/14/2026

Updated Qwen3.5-9B Quantization Comparison

This content compares various GGUF quantizations of the Qwen3.5-9B model using KL Divergence (KLD) to assess faithfulness to the BF16 baseline. The goal is to provide users with a data-driven basis for selecting the most faithful quantized file, where lower KLD scores indicate less information loss.

Qwen3.5-9B KLD quantization GGUF

Updated Qwen3.5-9B Quantization Comparison

ARTICLE↑ trendingReddit r/LocalLLaMA·4/12/2026

MiniMax-M2.7 vs Qwen3.5-122B-A10B for 96GB VRAM full offload?!

The author compares MiniMax-M2.7 and Qwen3.5-122B-A10B GGUF models for local full offload on a 96GB VRAM rig. For their purposes, Qwen3.5-122B is preferred, despite MiniMax being more quantized, highlighting the trade-offs in performance for local LLM inference.

VRAM GGUF MiniMax Qwen

MiniMax-M2.7 vs Qwen3.5-122B-A10B for 96GB VRAM full offload?!

ARTICLE↑ trendingReddit r/LocalLLaMA·4/8/2026

Qwen3.5-35B-A3B-Uncensored-FernflowerAI-GGUF

O autor encontrou e corrigiu um bug de treinamento no modelo Qwen3.5-35B-A3B, disponibilizando uma versão fixa, um prompt de sistema aprimorado, um template de chat com suporte a tool calling e configurações recomendadas para LM Studio. A correção aborda problemas de perda de contexto e repetição que ocorriam em conversas longas com a versão anterior do modelo.

Model Fix Qwen3.5 GGUF Uncensored

NEWS↑ trendingReddit r/LocalLLaMA·4/12/2026

mtmd: qwen3 audio support (qwen3-omni and qwen3-asr)

The Qwen3 model now supports audio input through its `qwen3-omni-moe` (multimodal with vision and audio input) and `qwen3-asr` (audio speech recognition) versions. GGUF models for Qwen3-Omni (30B variants) and Qwen3-ASR (1.7B and 0.6B) are available on Hugging Face for community use.

multimodal AI audio GGUF Qwen3

mtmd: qwen3 audio support (qwen3-omni and qwen3-asr)

ARTICLE↑ trendingReddit r/LocalLLaMA·5/6/2026

Uploaded Unsloth Qwen3.6-35B-A3B UD XL models with MTP grafted, here are the results

This post reports the results of the 35B A3B version of the Qwen3.6-35B-A3B UD XL models with MTP grafted, now available on HuggingFace. Initial tests showed limited speed gains (6% for Q4, 2.5% for Q8) on some setups, though other users reported more significant improvements (up to 50%) depending on their hardware.

AI models LLM optimization GGUF performance testing

ARTICLE↑ trendingReddit r/LocalLLaMA·4/26/2026

Switched from Qwen3.6 35b-a3b to Qwen3.6 27b mid coding and it's noticeably better!

A user switched from Qwen3.6 35b-a3b to Qwen3.6 27b (IQ3_M) mid-coding and found the latter noticeably better, even solving a difficult bug. They question if dense models handle compression better than MoE models, given the positive experience with a more aggressive quantization.

AI models local LLM Performance Comparison GGUF

Switched from Qwen3.6 35b-a3b to Qwen3.6 27b mid coding and it's noticeably better!

ARTICLE↑ trendingReddit r/LocalLLaMA·4/19/2026

Gemma 4 - MLX doesn't seem better than GGUF

A user compares the performance of the Gemma 4-26b-a4b model in MLX and GGUF versions on an M1 Max with 32GB RAM. Tests with a 3k token prompt indicate that GGUF is slightly faster in both prompt processing and tokens per second.

model performance apple-silicon Gemma MLX

DOC↑ trendingReddit r/LocalLLaMA·5/4/2026

it's time to update your Gemma 4 GGUFs

It's time to update your Gemma 4 GGUF models as the chat template was fixed a few days ago. Several links for downloading the updated models are provided.

AI models LLMs update Gemma

NEWS↑ trendingReddit r/LocalLLaMA·4/8/2026

It looks like we’ll need to download the new Gemma 4 GGUFs

Este conteúdo anuncia a atualização dos modelos Gemma 4 GGUF da Unsloth, incorporando várias melhorias e correções do projeto llama.cpp. As atualizações abordam aspectos técnicos como cache KV, suporte CUDA, manuseio de vocabulário e parsing específico para Gemma 4.

unsloth Gemma 4 modelos de IA llama.cpp

NEWS↑ trendingReddit r/LocalLLaMA·4/22/2026

unsloth Qwen3.6-27B-GGUF

The files for the unsloth Qwen3.6-27B model in GGUF format are finally available. This update marks the release of the long-awaited files for the specified AI model.

unsloth GGUF model release LLM

DOCDEV.to AI·5/10/2026

How to Deploy Llama 3.2 11B with GGUF Quantization on a $5/Month DigitalOcean Droplet: Production Inference Without GPU Costs

This article details how to deploy the Llama 3.2 11B model with GGUF quantization on a low-cost DigitalOcean Droplet for production inference. It demonstrates significant cost savings compared to paid AI APIs, while maintaining good performance on CPUs.

learning Llama 3 AI deployment Cost Optimization

NEWS↑ trendingReddit r/LocalLLaMA·4/8/2026

kepler-452b. GGUF when?

O título questiona a disponibilidade do formato GGUF para 'kepler-452b', sugerindo uma discussão sobre a versão GGUF de um modelo de IA. A entrada é um post simples de comunidade com links para mais detalhes.

GGUF model deployment LLM