MoE

21 items

ARTICLE↑ trendingReddit r/LocalLLaMA·4/22/2026

Forgive my ignorance but how is a 27B model better than 397B?

A user expresses confusion regarding how a 27B dense model could outperform a 397B Mixture-of-Experts (MoE) model, specifically mentioning Qwen, and questions the utility of the additional experts.

AI models Model Architecture MoE Qwen

Forgive my ignorance but how is a 27B model better than 397B?

RESEARCH↑ trendingReddit r/LocalLLaMA·4/9/2026

Used ray tracing cores on my RTX 5070 Ti for LLM routing — 218x speedup, runs entirely on 1 consumer GPU

Um método inovador usa os RT Cores de GPUs para roteamento de especialistas em modelos MoE, resultando em aceleração de 218x e 731x menos VRAM para essa tarefa. A pesquisa também revela que os especialistas MoE se especializam por tipo sintático, e não por tópico como se acreditava.

Otimização de Hardware IA MoE Ray Tracing Cores

ARTICLE↑ trendingReddit r/LocalLLaMA·4/15/2026

Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to layer-based single-GPU partial offload

This article details a novel dynamic expert caching strategy implemented in llama.cpp to accelerate token generation for large MoE models like Qwen3.5-122B-A10B. The approach loads frequently routed experts into VRAM, leading to up to 26.8% faster token generation compared to layer-based partial offload.

Token Generation llama.cpp VRAM Optimization MoE

NEWS↑ trendingReddit r/LocalLLaMA·5/7/2026

feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp

The content announces the addition of Mimo v2.5 model support in llama.cpp and describes its architecture. MiMo v2.5 is a Sparse MoE model with 310B total and 15B activated parameters, supporting text, image, video, and audio modalities with an impressive context length.

multimodal AI Model Architecture llama.cpp MoE

feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp

RESEARCH↑ trendingReddit r/LocalLLaMA·4/18/2026

Accidentally discovered you can teach frozen MoE models new knowledge by just steering their expert routing — no training needed

A novel method allows teaching frozen MoE models new knowledge by steering their expert routing, bypassing traditional training. Dubbed Adaptive Cognitive Intelligence (ACI), this technique demonstrated correcting factual errors in Gemma 4 using only a small configuration file.

model steering LLMs Gemma 4 Knowledge Injection

ARTICLE↑ trendingReddit r/LocalLLaMA·4/18/2026

LM Studio CPU thread pool size vs. tk/s with some MoE layers offloaded to CPU

This content analyzes the relationship between the CPU thread pool size in LM Studio and token generation speed (tk/s). It specifically focuses on scenarios where some Mixture of Experts (MoE) layers are offloaded to the CPU to optimize performance.

LLM optimization CPU performance MoE LM Studio

LM Studio CPU thread pool size vs. tk/s with some MoE layers offloaded to CPU

ARTICLE↑ trendingReddit r/LocalLLaMA·4/18/2026

RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part.

The content details how to optimize Qwen3.6-35B-A3B on consumer hardware (RTX 5070 Ti, Ryzen 9800X3D), achieving 79 t/s with 128K context. The key finding is the correct use of the `--n-cpu-moe N` flag in llama.cpp, which significantly outperforms the common `--cpu-moe` by utilizing more GPU VRAM for MoE experts.

llama.cpp AI optimization MoE LLM performance

NEWS↑ trendingReddit r/LocalLLaMA·4/9/2026

Marco-Mini (17.3B, 0.86B active) and Marco-Nano (8B, 0.6B active) by Alibaba

A Alibaba lançou recentemente os modelos Marco-Mini e Marco-Nano, variantes instrucionadas de modelos de linguagem multilingues altamente esparsos baseados em Mixture-of-Experts (MoE). O Marco-Mini, com apenas 0.86B de 17.3B parâmetros ativos, destaca-se por superar outros modelos de até 12B de parâmetros ativados em benchmarks de desempenho.

AI models LLMs Alibaba Sparse Models

RESEARCH↑ trendingReddit r/LocalLLaMA·4/18/2026

Qwen 3.6 35B A3B Q4_K_M quant evaluation

This content evaluates the performance of the Qwen 3.6 35B A3B Q4_K_M quantized MoE model on CPU, using benchmarks like HumanEval, HellaSwag, and BFCL. It achieved 22 tokens/sec, showing strong performance in commonsense reasoning (74%) and solid results for an active 3B MoE model.

AI model evaluation Benchmarking quantization MoE

Qwen 3.6 35B A3B Q4_K_M quant evaluation

RESEARCH↑ trendingReddit r/LocalLLaMA·4/22/2026

Dense vs. MoE gap is shrinking fast with the 3.6-27B release

Dense AI models currently outperform MoE overall, but MoE is rapidly catching up, particularly in coding benchmarks. For users with 24GB VRAM and a need for large context windows, MoE is becoming a more appealing option.

AI models LLMs Benchmarks MoE

Dense vs. MoE gap is shrinking fast with the 3.6-27B release

ARTICLE↑ trendingReddit r/LocalLLaMA·4/11/2026

If Dense Models are better for Coding, why are Qwen-Coders MoE?

The author questions Qwen's decision to use the Mixture-of-Experts (MoE) architecture for its coding models, instead of more accurate dense models. They speculate the choice might be related to inference speed and regret the absence of a 14B successor.

Model Architecture coding AI MoE AI

RESEARCHarXiv CS.CL·4/7/2026

Unveiling Language Routing Isolation in Multilingual MoE Models for Interpretable Subnetwork Adaptation

This content explores language routing isolation in multilingual Mixture of Experts (MoE) models, aiming for more interpretable subnetwork adaptation.

Multilingual Models Subnetwork Adaptation MoE AI

ARTICLEDEV.to AI·4/16/2026

How to run Qwen3.6-35B-A3B locally — the coding MoE that beats models 10x its active size

Qwen has released Qwen3.6-35B-A3B, a new Mixture-of-Experts model that delivers big-model quality at small-model speed with vision capabilities. It outperforms models 10x its active size on coding benchmarks like SWE-bench and Terminal-Bench, and also excels in science reasoning and frontend generation.

multimodal AI AI Benchmarks coding AI MoE

RESEARCHDEV.to AI·4/23/2026

qwen3.6-27b scores 77.2% on SWE-bench. the dense model is winning against MoE.

The Qwen3.6-27B dense model outperformed the Qwen3.6-35B-A3B MoE model on SWE-bench, scoring 77.2% versus 73.4%. This indicates that dense models may be proving more effective for real-world software engineering tasks.

AI models Model Architecture Benchmarks MoE

RESEARCHarXiv CS.LG·4/9/2026

TalkLoRA: Communication-Aware Mixture of Low-Rank Adaptation for Large Language Models

TalkLoRA propõe um framework MoELoRA que aborda a instabilidade de roteamento e a dominância de especialistas em métodos existentes, permitindo a comunicação entre especialistas antes do roteamento. Isso é feito através de um Módulo de Conversação leve, que facilita a troca de informações, gerando um sinal de roteamento mais robusto para Large Language Models (LLMs).

LLMs MoE Communication Fine-tuning

RESEARCHarXiv CS.LG·20d ago

HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models

HELLoRA proposes a novel method for fine-tuning Mixture-of-Experts (MoE) models by applying Low-Rank Adaptation (LoRA) modules only to the most frequently activated experts at each layer. This technique significantly reduces trainable parameters and improves downstream performance, attributing its success to structured regularization that maintains expert specialization.

LLMs MoE AI Fine-tuning

ARTICLEDEV.to AI·18d ago

MiniMax M2.7 API Pricing 2026: Free Tier, Setup, and How It Stacks Against DeepSeek and Kimi

MiniMax M2.7 is a competitive 230-billion-parameter Mixture-of-Experts AI model, released in March 2026. Designed for "agentic" workflows, it delivers capabilities approaching proprietary competitors. The model maintains significantly lower operational expenses for organizations implementing agent-based systems.

AI models MoE Agentic AI MiniMax M2.7

NEWSQwen Blog·4/28/2025

Qwen3: Think Deeper, Act Faster

Qwen3, a nova família de modelos de linguagem, foi lançada, com o modelo principal Qwen3-235B-A22B alcançando resultados competitivos em benchmarks. Modelos menores como Qwen3-30B-A3B e Qwen3-4B também demonstraram desempenho superior em comparação com outros modelos.

AI models Benchmarks MoE Qwen3

ARTICLEQwen Blog·1/28/2025

Qwen2.5-Max: Exploring the Intelligence of Large-scale MoE Model

O conteúdo explora a importância da escalabilidade contínua de dados e modelos (densos ou Mixture-of-Expert) para aprimorar a inteligência artificial, destacando a experiência limitada da comunidade na área. Menciona que detalhes críticos de escalabilidade foram recentemente divulgados pelo DeepSeek V3 e que o Qwen2 está em desenvolvimento.

IA modelos de linguagem MoE

ARTICLEQwen Blog·1/20/2025

Global-batch load balance almost free lunch to improve your MoE LLM training

O conteúdo introduz a arquitetura Mixture-of-Experts (MoE) como uma técnica popular para escalar parâmetros de modelos. Ele descreve a camada MoE consistindo de um roteador e um grupo de experts, onde apenas um subconjunto é ativado para processar uma entrada.

deep learning Training MoE Neural Architecture