← heapsort-ai

MoE

21 items

ARTICLE↑ trendingReddit r/LocalLLaMA·4/15/2026

Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to layer-based single-GPU partial offload

This article details a novel dynamic expert caching strategy implemented in llama.cpp to accelerate token generation for large MoE models like Qwen3.5-122B-A10B. The approach loads frequently routed experts into VRAM, leading to up to 26.8% faster token generation compared to layer-based partial offload.

43
RESEARCH↑ trendingReddit r/LocalLLaMA·4/18/2026

Accidentally discovered you can teach frozen MoE models new knowledge by just steering their expert routing — no training needed

A novel method allows teaching frozen MoE models new knowledge by steering their expert routing, bypassing traditional training. Dubbed Adaptive Cognitive Intelligence (ACI), this technique demonstrated correcting factual errors in Gemma 4 using only a small configuration file.

42
ARTICLE↑ trendingReddit r/LocalLLaMA·4/18/2026

RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part.

The content details how to optimize Qwen3.6-35B-A3B on consumer hardware (RTX 5070 Ti, Ryzen 9800X3D), achieving 79 t/s with 128K context. The key finding is the correct use of the `--n-cpu-moe N` flag in llama.cpp, which significantly outperforms the common `--cpu-moe` by utilizing more GPU VRAM for MoE experts.

42
NEWS↑ trendingReddit r/LocalLLaMA·4/9/2026

Marco-Mini (17.3B, 0.86B active) and Marco-Nano (8B, 0.6B active) by Alibaba

A Alibaba lançou recentemente os modelos Marco-Mini e Marco-Nano, variantes instrucionadas de modelos de linguagem multilingues altamente esparsos baseados em Mixture-of-Experts (MoE). O Marco-Mini, com apenas 0.86B de 17.3B parâmetros ativos, destaca-se por superar outros modelos de até 12B de parâmetros ativados em benchmarks de desempenho.

42
RESEARCHarXiv CS.LG·4/9/2026

TalkLoRA: Communication-Aware Mixture of Low-Rank Adaptation for Large Language Models

TalkLoRA propõe um framework MoELoRA que aborda a instabilidade de roteamento e a dominância de especialistas em métodos existentes, permitindo a comunicação entre especialistas antes do roteamento. Isso é feito através de um Módulo de Conversação leve, que facilita a troca de informações, gerando um sinal de roteamento mais robusto para Large Language Models (LLMs).

27
RESEARCHarXiv CS.LG·20d ago

HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models

HELLoRA proposes a novel method for fine-tuning Mixture-of-Experts (MoE) models by applying Low-Rank Adaptation (LoRA) modules only to the most frequently activated experts at each layer. This technique significantly reduces trainable parameters and improves downstream performance, attributing its success to structured regularization that maintains expert specialization.

27
NEWSQwen Blog·4/28/2025

Qwen3: Think Deeper, Act Faster

Qwen3, a nova família de modelos de linguagem, foi lançada, com o modelo principal Qwen3-235B-A22B alcançando resultados competitivos em benchmarks. Modelos menores como Qwen3-30B-A3B e Qwen3-4B também demonstraram desempenho superior em comparação com outros modelos.

23
ARTICLEQwen Blog·1/28/2025

Qwen2.5-Max: Exploring the Intelligence of Large-scale MoE Model

O conteúdo explora a importância da escalabilidade contínua de dados e modelos (densos ou Mixture-of-Expert) para aprimorar a inteligência artificial, destacando a experiência limitada da comunidade na área. Menciona que detalhes críticos de escalabilidade foram recentemente divulgados pelo DeepSeek V3 e que o Qwen2 está em desenvolvimento.

23