← heapsort-ai

VRAM Optimization

3 items

ARTICLE↑ trendingReddit r/LocalLLaMA·4/15/2026

Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to layer-based single-GPU partial offload

This article details a novel dynamic expert caching strategy implemented in llama.cpp to accelerate token generation for large MoE models like Qwen3.5-122B-A10B. The approach loads frequently routed experts into VRAM, leading to up to 26.8% faster token generation compared to layer-based partial offload.

43
RESEARCHDEV.to AI·7d ago

Elemetry data: Running 284B MoE at 0.00 GB Active VRAM

This content shares hardware telemetry data from an architectural test evaluating frontier-scale model execution on highly constrained, commodity hardware footprints. It details benchmarking a 284B parameter Mixture-of-Experts (MoE) architecture, achieving 0.00 GB active GPU VRAM by decoupling physical weight storage from active local graphics allocation.

27