Qwen3.5

7 items

ARTICLE↑ trendingReddit r/LocalLLaMA·4/11/2026

Intel Arc Pro B70 32GB performance on Qwen3.5-27B@Q4

The Intel Arc Pro B70 32GB card achieved ~12 tps for single queries and 135 tps with 32 concurrent requests on Qwen3.5-27B@Q4, which is 20% less than the RTX PRO 4500. Furthermore, it consumed 50% more power under high concurrency, with tensor parallelism degrading performance while pipeline parallelism improved it.

Qwen3.5 llama.cpp GPU performance Intel Arc Pro B70

ARTICLE↑ trendingReddit r/LocalLLaMA·4/10/2026

Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results

O autor compartilha resultados de otimização de um servidor de inferência com duas GPUs para LLMs, alcançando 198 tok/s com o modelo Qwen3.5-122B NVFP4. O conteúdo detalha a configuração de hardware (2x RTX PRO 6000 Blackwell) e compara o desempenho de diferentes motores e modelos de linguagem.

Qwen3.5 Benchmarking GPU performance LLM inference

NEWS↑ trendingReddit r/LocalLLaMA·5/4/2026

Llama.cpp MTP support now in beta!

Llama.cpp's MTP support is now in beta, initially supporting Qwen3.5 MTP, with potential for an imminent merge. This enhancement, alongside maturing tensor-parallel support, is expected to close performance gaps with vLLM, particularly in token generation speeds.

AI models Qwen3.5 MTP llama.cpp

ARTICLE↑ trendingReddit r/LocalLLaMA·4/8/2026

Qwen3.5-35B-A3B-Uncensored-FernflowerAI-GGUF

O autor encontrou e corrigiu um bug de treinamento no modelo Qwen3.5-35B-A3B, disponibilizando uma versão fixa, um prompt de sistema aprimorado, um template de chat com suporte a tool calling e configurações recomendadas para LM Studio. A correção aborda problemas de perda de contexto e repetição que ocorriam em conversas longas com a versão anterior do modelo.

Model Fix Qwen3.5 GGUF Uncensored

NEWS↑ trendingReddit r/LocalLLaMA·4/15/2026

DFlash Doubles the T/S Gen Speed of Qwen3.5 27B (BF16) on Mac M5 Max

The new DFlash support in oMLX 0.3.5 RC1 has reportedly doubled the generation speed of the Qwen3.5 27B (BF16) model on a Mac M5 Max, increasing it from 9 to 22 T/S. This breakthrough could significantly improve local deployment of this high-quality model at higher quantizations/full weights.

oMLX DFlash Qwen3.5 AI performance

DFlash Doubles the T/S Gen Speed of Qwen3.5 27B (BF16) on Mac M5 Max

ARTICLE↑ trendingReddit r/LocalLLaMA·4/10/2026

I no longer need a cloud LLM to do quick web research

O autor compartilha sua configuração para pesquisa e raspagem web rápida usando LLMs locais, especificamente Qwen3.5:27B-Q3_K_M em uma RTX 4090 com llama.cpp. Ele detalha as ferramentas e o processo que o permite realizar extração eficaz de conteúdo web offline, indicando que modelos locais agora atendem aos seus padrões de qualidade.

RTX 4090 Qwen3.5 local LLM llama.cpp

ARTICLEDEV.to AI·5/3/2026

BizNode uses Ollama (Qwen3.5) running locally on your hardware — your data never leaves your machine. True AI privacy

BizNode employs Ollama (Qwen3.5) to run locally on user hardware, ensuring that data never leaves the machine. This provides true AI privacy, keeping user information secure.

Qwen3.5 Ollama privacy security