POV Qwen 3.5 with thinking
This content discusses the behavior of the AI model Qwen 3.5, which frequently gets stuck in thinking loops. The author makes a brief, informal observation about this characteristic of the model.

This content discusses the behavior of the AI model Qwen 3.5, which frequently gets stuck in thinking loops. The author makes a brief, informal observation about this characteristic of the model.

A user expresses confusion regarding how a 27B dense model could outperform a 397B Mixture-of-Experts (MoE) model, specifically mentioning Qwen, and questions the utility of the additional experts.

The author finds Qwen 3.6 to be the first local model genuinely worth the effort, unlike previous experiences with models that were either too weak or required excessive tweaking. Running on a 5090 + 4090 setup, the Q8 model provides 260k context and 170 tokens/second, proving effective for coding tasks like UI XML and embedded C++.
The user praises Qwen3.6 OpenCode as an "incredible" local model for complex coding tasks, highlighting its effectiveness in implementing RLS across a multi-language codebase. While not perfect, its ability to iterate on compiler errors makes it a viable alternative to models like Claude Code for daily use.
A user recounts their experience with the Qwen3.6 model, which successfully built and tested a tower defense game, demonstrating the ability to identify and fix its own bugs. The AI confirmed builds using screenshots, astonishing the user with its advanced capabilities.

This content describes a native DFlash implementation on MLX for Apple Silicon, significantly accelerating token generation in Qwen models. The speculative decoding technique achieves speedups of up to 3.3x while maintaining identical output quality.
This content details how to achieve 2.5x faster inference with Qwen 3.6 27B using MTP support in llama.cpp, enabling 28 tok/s on an M2 Max. It provides converted GGUF files for download, suitable for local agentic coding with 262k context on 48GB.
Qwen 3.6 now ships with a new `preserve_thinking` flag that addresses the KV cache invalidation issue by maintaining the model's full reasoning context. This feature is particularly beneficial for agent scenarios, enhancing decision consistency and optimizing token consumption and KV cache utilization.

The Qwen 3.6 27B model has been released, representing a new addition to large language models. The announcement links to the model's official Hugging Face page for further details.
The author tested the Qwen 3.6 35b MTP model locally, observing a 1.5x increase in speed. They explored the use of a large context window, reaching 300k tokens with potential for higher.
This document details the optimized execution of the Qwen3.5-397B-A17B-MXFP4 model using vLLM on RDNA4 GPUs, such as 8xR9700. It provides a Dockerfile with Triton patches and instructions for downloading the model and launching the inference container.
The Qwen3.6-35B-A3B "Aggressive" variant has been released, offering an uncensored version of the original model with no refusals and zero capability loss. This release includes various K_P quants and vision support.
The content details how to achieve faster performance with the Qwen 3.6 27B model using llama.cpp on a 3090 GPU. It includes steps to apply a specific commit and `llama-server` setup commands to reach 50 t/s with 100k context.
The author compares MiniMax-M2.7 and Qwen3.5-122B-A10B GGUF models for local full offload on a 96GB VRAM rig. For their purposes, Qwen3.5-122B is preferred, despite MiniMax being more quantized, highlighting the trade-offs in performance for local LLM inference.

The author conducted a personal benchmark where Qwen 3.6 35B significantly outperformed Gemma 4 26B across tests evaluating agentic capabilities, coding, image-to-text synthesis, instruction following, and reasoning. Qwen fixed more issues, showed fewer regressions, and completed the tasks in less time, indicating superior overall performance.
The author successfully implemented Qwen 3.6 models (27B and 35B) locally for coding, demonstrating comparable performance to Claude Code. This local setup drastically reduced costs, from an estimated $142 in API calls to less than $4 in electricity over 8 hours.

This content compares the quality of various Qwen 3.6 27B model quantizations using a custom chess game test to find the optimal one for 16 GB VRAM setups. It evaluates the models' ability to track board states and generate accurate SVG images of the chessboard.

The author shares a successful optimization for running the Qwen3.5-35B-A3B-UD-Q4_K_L model on an RTX 4060 Ti 16GB using llama.cpp, achieving 40-60 tokens/s with 64k context. The post provides the detailed `models.ini` configuration and server start command to replicate this performance.
A user reports that Qwen 3.6 demonstrates a significant performance leap, proving capable for workloads typically handled by Opus and Codex, though not yet at their level. The user highlights its usefulness and speed when properly configured with `preserve_thinking` on an M5 Max with specific settings.

This content describes the creation of Docker images for `llama.cpp` to simplify running MTP models, following numerous improvements and bug fixes. It also notes that Unsloth has released new MTP models for Qwen 3.6, making previous versions obsolete.