← heapsort-ai

llama.cpp

33 items

DOC↑ trendingReddit r/LocalLLaMA·5/6/2026

Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

This content details the implementation of Multi-Token Prediction (MTP) with quantized GGUFs for Qwen3-27B, utilizing Unsloth's UD XL quantizations with Q8_0 MTP layers grafted on top, resulting in a 2.5x throughput increase. The author shares grafted GGUF files, raw MTP layer source, and a conversion script, along with custom llama.cpp build instructions incorporating speculative decoding support from an unmerged PR.

43
DOC↑ trendingReddit r/LocalLLaMA·5/6/2026

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

This content details how to achieve 2.5x faster inference with Qwen 3.6 27B using MTP support in llama.cpp, enabling 28 tok/s on an M2 Max. It provides converted GGUF files for download, suitable for local agentic coding with 262k context on 48GB.

43
ARTICLE↑ trendingReddit r/LocalLLaMA·5/7/2026

why llama.cpp can’t combine speculative decode methods?

A user is exploring why speculative decode methods like MTP and N-gram cannot be combined simultaneously in llama.cpp, noting that N-gram offers significant improvements for agentic coding. They seek to understand if this is a fundamental or implementation limitation, finding that others have already asked the same question.

43
ARTICLE↑ trendingReddit r/LocalLLaMA·4/15/2026

Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to layer-based single-GPU partial offload

This article details a novel dynamic expert caching strategy implemented in llama.cpp to accelerate token generation for large MoE models like Qwen3.5-122B-A10B. The approach loads frequently routed experts into VRAM, leading to up to 26.8% faster token generation compared to layer-based partial offload.

43
ARTICLE↑ trendingReddit r/LocalLLaMA·4/18/2026

RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part.

The content details how to optimize Qwen3.6-35B-A3B on consumer hardware (RTX 5070 Ti, Ryzen 9800X3D), achieving 79 t/s with 128K context. The key finding is the correct use of the `--n-cpu-moe N` flag in llama.cpp, which significantly outperforms the common `--cpu-moe` by utilizing more GPU VRAM for MoE experts.

42
NEWS↑ trendingReddit r/LocalLLaMA·5/4/2026

Llama.cpp MTP support now in beta!

Llama.cpp's MTP support is now in beta, initially supporting Qwen3.5 MTP, with potential for an imminent merge. This enhancement, alongside maturing tensor-parallel support, is expected to close performance gaps with vLLM, particularly in token generation speeds.

Llama.cpp MTP support now in beta!
42
DOC↑ trendingReddit r/LocalLLaMA·5/6/2026

Get faster qwen 3.6 27b

The content details how to achieve faster performance with the Qwen 3.6 27B model using llama.cpp on a 3090 GPU. It includes steps to apply a specific commit and `llama-server` setup commands to reach 50 t/s with 100k context.

42
DOC↑ trendingReddit r/LocalLLaMA·27d ago

llama.cpp docker images to run MTP models

This content describes the creation of Docker images for `llama.cpp` to simplify running MTP models, following numerous improvements and bug fixes. It also notes that Unsloth has released new MTP models for Qwen 3.6, making previous versions obsolete.

41