llama.cpp

33 items

DOC↑ trendingReddit r/LocalLLaMA·4/22/2026

Running Qwen3.6-35B-A3B Locally for Coding Agent: My Setup & Working Config

This content details a setup for running the Qwen3.6-35B-A3B model locally on a MacBook Pro M2 Max. It describes the integration with the `pi` coding agent via `llama.cpp` and `llama-server`, including configuration parameters and command line setup.

Coding Agent llama.cpp Local AI macOS

ARTICLE↑ trendingReddit r/LocalLLaMA·4/11/2026

Intel Arc Pro B70 32GB performance on Qwen3.5-27B@Q4

The Intel Arc Pro B70 32GB card achieved ~12 tps for single queries and 135 tps with 32 concurrent requests on Qwen3.5-27B@Q4, which is 20% less than the RTX PRO 4500. Furthermore, it consumed 50% more power under high concurrency, with tensor parallelism degrading performance while pipeline parallelism improved it.

Qwen3.5 llama.cpp GPU performance Intel Arc Pro B70

DOC↑ trendingReddit r/LocalLLaMA·5/6/2026

Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

This content details the implementation of Multi-Token Prediction (MTP) with quantized GGUFs for Qwen3-27B, utilizing Unsloth's UD XL quantizations with Q8_0 MTP layers grafted on top, resulting in a 2.5x throughput increase. The author shares grafted GGUF files, raw MTP layer source, and a conversion script, along with custom llama.cpp build instructions incorporating speculative decoding support from an unmerged PR.

Multi-Token Prediction llama.cpp quantization large language models

DOC↑ trendingReddit r/LocalLLaMA·5/6/2026

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

This content details how to achieve 2.5x faster inference with Qwen 3.6 27B using MTP support in llama.cpp, enabling 28 tok/s on an M2 Max. It provides converted GGUF files for download, suitable for local agentic coding with 262k context on 48GB.

LLM optimization llama.cpp GGUF Qwen

ARTICLE↑ trendingReddit r/LocalLLaMA·5/7/2026

why llama.cpp can’t combine speculative decode methods?

A user is exploring why speculative decode methods like MTP and N-gram cannot be combined simultaneously in llama.cpp, noting that N-gram offers significant improvements for agentic coding. They seek to understand if this is a fundamental or implementation limitation, finding that others have already asked the same question.

Optimization LLMs llama.cpp Qwen3.6

ARTICLE↑ trendingReddit r/LocalLLaMA·4/15/2026

Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to layer-based single-GPU partial offload

This article details a novel dynamic expert caching strategy implemented in llama.cpp to accelerate token generation for large MoE models like Qwen3.5-122B-A10B. The approach loads frequently routed experts into VRAM, leading to up to 26.8% faster token generation compared to layer-based partial offload.

Token Generation llama.cpp VRAM Optimization MoE

NEWS↑ trendingReddit r/LocalLLaMA·5/7/2026

feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp

The content announces the addition of Mimo v2.5 model support in llama.cpp and describes its architecture. MiMo v2.5 is a Sparse MoE model with 310B total and 15B activated parameters, supporting text, image, video, and audio modalities with an impressive context length.

multimodal AI Model Architecture llama.cpp MoE

feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp

NEWS↑ trendingReddit r/LocalLLaMA·4/9/2026

Gemma 4 on Llama.cpp should be stable now

A integração de correções no Llama.cpp resolveu problemas conhecidos do Gemma 4, tornando-o estável para uso. O conteúdo oferece dicas de execução, como uso de `--chat-template-file` e otimização de cache, além de alertar contra o uso do CUDA 13.2.

Technical Tips Gemma 4 llama.cpp performance

ARTICLE↑ trendingReddit r/LocalLLaMA·4/12/2026

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Speculative decoding tests using Gemma 4 E2B as a draft for Gemma 4 31B revealed a remarkable performance improvement. Average speed increased by 29%, reaching 50% in code generation, with specific hardware and software configurations.

Gemma 4 31B llama.cpp benchmark AI performance

ARTICLE↑ trendingReddit r/LocalLLaMA·4/18/2026

RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part.

The content details how to optimize Qwen3.6-35B-A3B on consumer hardware (RTX 5070 Ti, Ryzen 9800X3D), achieving 79 t/s with 128K context. The key finding is the correct use of the `--n-cpu-moe N` flag in llama.cpp, which significantly outperforms the common `--cpu-moe` by utilizing more GPU VRAM for MoE experts.

llama.cpp AI optimization MoE LLM performance

NEWS↑ trendingReddit r/LocalLLaMA·4/19/2026

llama.cpp speculative checkpointing was merged

Speculative checkpointing has been merged into llama.cpp, potentially offering speedups for certain prompts. While some prompts, like for coding with optimized parameters, can see 0-50% speed improvement, others may not benefit due to low draft acceptance streaks.

Open Source llama.cpp speculative-checkpointing AI inference

ARTICLE↑ trendingReddit r/LocalLLaMA·18d ago

[llama.cpp] Asymmetric KV q8/q4 cache: current caveats and discussion in GGML repo

This content addresses a challenge in llama.cpp concerning asymmetric KV q8/q4 cache quantization, which can lead to CPU processing on CUDA. A GitHub discussion highlights a solution involving compiling with a specific KV cache quant combo, offering substantial memory savings with only a 1.3% precision loss.

llama.cpp GPU optimization quantization KV cache

NEWS↑ trendingReddit r/LocalLLaMA·5/4/2026

Llama.cpp MTP support now in beta!

Llama.cpp's MTP support is now in beta, initially supporting Qwen3.5 MTP, with potential for an imminent merge. This enhancement, alongside maturing tensor-parallel support, is expected to close performance gaps with vLLM, particularly in token generation speeds.

AI models Qwen3.5 MTP llama.cpp

DOC↑ trendingReddit r/LocalLLaMA·5/6/2026

Get faster qwen 3.6 27b

The content details how to achieve faster performance with the Qwen 3.6 27B model using llama.cpp on a 3090 GPU. It includes steps to apply a specific commit and `llama-server` setup commands to reach 50 t/s with 100k context.

llama.cpp AI optimization GPU performance GGUF

DOC↑ trendingReddit r/LocalLLaMA·4/15/2026

Qwen3.5-35B running well on RTX4060 Ti 16GB at 60 tok/s

The author shares a successful optimization for running the Qwen3.5-35B-A3B-UD-Q4_K_L model on an RTX 4060 Ti 16GB using llama.cpp, achieving 40-60 tokens/s with 64k context. The post provides the detailed `models.ini` configuration and server start command to replicate this performance.

Hardware Acceleration AI Model Optimization llama.cpp local inference

RESEARCH↑ trendingReddit r/LocalLLaMA·19d ago

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp

The author achieved 110 tok/s with 12GB VRAM using ik_llama.cpp on the Qwen3.6 35B A3B model, noting a significant speed boost. This performance surpassed that of regular llama.cpp after its MTP PR merge.

GPU VRAM LLM optimization llama.cpp Benchmarking

ARTICLE↑ trendingReddit r/LocalLLaMA·4/20/2026

Why doesn't any OSS tool treat llama.cpp as a first class citizen?

This article questions why `llama.cpp` is not treated as a first-class citizen by open-source tools. It delves into the integration and recognition of `llama.cpp` within the OSS ecosystem.

Open Source llama.cpp AI tools

DOC↑ trendingReddit r/LocalLLaMA·27d ago

llama.cpp docker images to run MTP models

This content describes the creation of Docker images for `llama.cpp` to simplify running MTP models, following numerous improvements and bug fixes. It also notes that Unsloth has released new MTP models for Qwen 3.6, making previous versions obsolete.

AI models Docker llama.cpp Qwen

NEWS↑ trendingReddit r/LocalLLaMA·4/9/2026

ggml: backend-agnostic tensor parallelism by JohannesGaessler · Pull Request #19378 · ggml-org/llama.cpp

A solicitação de pull request de JohannesGaessler sobre paralelismo de tensor agnóstico de backend para o projeto ggml-org/llama.cpp foi aprovada por Greganov. Este é um desenvolvimento importante para a eficiência e escalabilidade da inferência de modelos de IA.

llama.cpp tensor parallelism machine learning AI

NEWS↑ trendingReddit r/LocalLLaMA·4/12/2026

Audio processing landed in llama-server with Gemma-4

Llama.cpp (llama-server) now officially supports Speech-to-Text (STT) capabilities, integrating the Gemma-4 E2A and E4A models. This update brings advanced audio processing to the popular open-source AI inference platform.

Gemma 4 audio processing llama.cpp llama-server

Audio processing landed in llama-server with Gemma-4