Qwen

46 items

ARTICLE↑ trendingReddit r/LocalLLaMA·4/23/2026

POV Qwen 3.5 with thinking

This content discusses the behavior of the AI model Qwen 3.5, which frequently gets stuck in thinking loops. The author makes a brief, informal observation about this characteristic of the model.

thinking loops model behavior AI model Qwen

ARTICLE↑ trendingReddit r/LocalLLaMA·4/22/2026

Forgive my ignorance but how is a 27B model better than 397B?

A user expresses confusion regarding how a 27B dense model could outperform a 397B Mixture-of-Experts (MoE) model, specifically mentioning Qwen, and questions the utility of the additional experts.

AI models Model Architecture MoE Qwen

Forgive my ignorance but how is a 27B model better than 397B?

ARTICLE↑ trendingReddit r/LocalLLaMA·4/17/2026

Qwen 3.6 is the first local model that actually feels worth the effort for me

The author finds Qwen 3.6 to be the first local model genuinely worth the effort, unlike previous experiences with models that were either too weak or required excessive tweaking. Running on a 5090 + 4090 setup, the Q8 model provides 260k context and 170 tokens/second, proving effective for coding tasks like UI XML and embedded C++.

LLMs local models Qwen developer experience

CASE↑ trendingReddit r/LocalLLaMA·4/17/2026

Qwen3.6 is incredible with OpenCode!

The user praises Qwen3.6 OpenCode as an "incredible" local model for complex coding tasks, highlighting its effectiveness in implementing RLS across a multi-language codebase. While not perfect, its ability to iterate on compiler errors makes it a viable alternative to models like Claude Code for daily use.

coding assistant OpenCode AI model review Qwen

CASE↑ trendingReddit r/LocalLLaMA·4/17/2026

Qwen3.6. This is it.

A user recounts their experience with the Qwen3.6 model, which successfully built and tested a tower defense game, demonstrating the ability to identify and fix its own bugs. The AI confirmed builds using screenshots, astonishing the user with its advanced capabilities.

game development code generation AI programming Qwen

RESEARCH↑ trendingReddit r/LocalLLaMA·4/11/2026

DFlash speculative decoding on Apple Silicon : 85 tok/s, 3.3x on Qwen3.5-9B (MLX, M5 Max)

This content describes a native DFlash implementation on MLX for Apple Silicon, significantly accelerating token generation in Qwen models. The speculative decoding technique achieves speedups of up to 3.3x while maintaining identical output quality.

apple-silicon MLX Qwen LLM performance

DOC↑ trendingReddit r/LocalLLaMA·5/6/2026

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

This content details how to achieve 2.5x faster inference with Qwen 3.6 27B using MTP support in llama.cpp, enabling 28 tok/s on an M2 Max. It provides converted GGUF files for download, suitable for local agentic coding with 262k context on 48GB.

LLM optimization llama.cpp GGUF Qwen

ARTICLE↑ trendingReddit r/LocalLLaMA·4/16/2026

PSA: Qwen3.6 ships with preserve_thinking. Make sure you have it on.

Qwen 3.6 now ships with a new `preserve_thinking` flag that addresses the KV cache invalidation issue by maintaining the model's full reasoning context. This feature is particularly beneficial for agent scenarios, enhancing decision consistency and optimizing token consumption and KV cache utilization.

large language models model optimization Qwen AI agents

PSA: Qwen3.6 ships with preserve_thinking. Make sure you have it on.

NEWS↑ trendingReddit r/LocalLLaMA·4/22/2026

Qwen 3.6 27B is out

The Qwen 3.6 27B model has been released, representing a new addition to large language models. The announcement links to the model's official Hugging Face page for further details.

Qwen model release Large Language Model LLM

ARTICLE↑ trendingReddit r/LocalLLaMA·25d ago

Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)

The author tested the Qwen 3.6 35b MTP model locally, observing a 1.5x increase in speed. They explored the use of a large context window, reaching 300k tokens with potential for higher.

LLMs Benchmarking Local AI Qwen

DOC↑ trendingReddit r/LocalLLaMA·4/11/2026

Run Qwen3.5-397B-A13B with vLLM and 8xR9700

This document details the optimized execution of the Qwen3.5-397B-A17B-MXFP4 model using vLLM on RDNA4 GPUs, such as 8xR9700. It provides a Dockerfile with Triton patches and instructions for downloading the model and launching the inference container.

Docker GPU MXFP4 Qwen

NEWS↑ trendingReddit r/LocalLLaMA·4/17/2026

Qwen3.6-35B-A3B Uncensored Aggressive is out with K_P quants!

The Qwen3.6-35B-A3B "Aggressive" variant has been released, offering an uncensored version of the original model with no refusals and zero capability loss. This release includes various K_P quants and vision support.

uncensored AI quantization Qwen model release

DOC↑ trendingReddit r/LocalLLaMA·5/6/2026

Get faster qwen 3.6 27b

The content details how to achieve faster performance with the Qwen 3.6 27B model using llama.cpp on a 3090 GPU. It includes steps to apply a specific commit and `llama-server` setup commands to reach 50 t/s with 100k context.

llama.cpp AI optimization GPU performance GGUF

ARTICLE↑ trendingReddit r/LocalLLaMA·4/12/2026

MiniMax-M2.7 vs Qwen3.5-122B-A10B for 96GB VRAM full offload?!

The author compares MiniMax-M2.7 and Qwen3.5-122B-A10B GGUF models for local full offload on a 96GB VRAM rig. For their purposes, Qwen3.5-122B is preferred, despite MiniMax being more quantized, highlighting the trade-offs in performance for local LLM inference.

VRAM GGUF MiniMax Qwen

MiniMax-M2.7 vs Qwen3.5-122B-A10B for 96GB VRAM full offload?!

ARTICLE↑ trendingReddit r/LocalLLaMA·4/17/2026

Qwen 3.6 35B crushes Gemma 4 26B on my tests

The author conducted a personal benchmark where Qwen 3.6 35B significantly outperformed Gemma 4 26B across tests evaluating agentic capabilities, coding, image-to-text synthesis, instruction following, and reasoning. Qwen fixed more issues, showed fewer regressions, and completed the tasks in less time, indicating superior overall performance.

LLM benchmarking Gemma Agentic AI Qwen

CASE↑ trendingReddit r/LocalLLaMA·4/23/2026

Qwen 3.6 is actually useful for vibe-coding, and way cheaper than Claude

The author successfully implemented Qwen 3.6 models (27B and 35B) locally for coding, demonstrating comparable performance to Claude Code. This local setup drastically reduced costs, from an estimated $142 in API calls to less than $4 in electricity over 8 hours.

GPU Claude local inference Cost Savings

Qwen 3.6 is actually useful for vibe-coding, and way cheaper than Claude

RESEARCH↑ trendingReddit r/LocalLLaMA·5/6/2026

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

This content compares the quality of various Qwen 3.6 27B model quantizations using a custom chess game test to find the optimal one for 16 GB VRAM setups. It evaluates the models' ability to track board states and generate accurate SVG images of the chessboard.

VRAM Benchmarking quantization model quality

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

DOC↑ trendingReddit r/LocalLLaMA·4/15/2026

Qwen3.5-35B running well on RTX4060 Ti 16GB at 60 tok/s

The author shares a successful optimization for running the Qwen3.5-35B-A3B-UD-Q4_K_L model on an RTX 4060 Ti 16GB using llama.cpp, achieving 40-60 tokens/s with 64k context. The post provides the detailed `models.ini` configuration and server start command to replicate this performance.

Hardware Acceleration AI Model Optimization llama.cpp local inference

CASE↑ trendingReddit r/LocalLLaMA·4/18/2026

qwen3.6 performance jump is real, just make sure you have it properly configured

A user reports that Qwen 3.6 demonstrates a significant performance leap, proving capable for workloads typically handled by Opus and Codex, though not yet at their level. The user highlights its usefulness and speed when properly configured with `preserve_thinking` on an M5 Max with specific settings.

LLMs AI hardware local inference AI performance

qwen3.6 performance jump is real, just make sure you have it properly configured

DOC↑ trendingReddit r/LocalLLaMA·27d ago

llama.cpp docker images to run MTP models

This content describes the creation of Docker images for `llama.cpp` to simplify running MTP models, following numerous improvements and bug fixes. It also notes that Unsloth has released new MTP models for Qwen 3.6, making previous versions obsolete.

AI models Docker llama.cpp Qwen