VRAM Optimization

3 items

ARTICLE↑ trendingReddit r/LocalLLaMA·4/15/2026

Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to layer-based single-GPU partial offload

This article details a novel dynamic expert caching strategy implemented in llama.cpp to accelerate token generation for large MoE models like Qwen3.5-122B-A10B. The approach loads frequently routed experts into VRAM, leading to up to 26.8% faster token generation compared to layer-based partial offload.

Token Generation llama.cpp VRAM Optimization MoE

NEWS↑ trendingReddit r/MachineLearning·4/24/2026

[New Optimizer] 🌹 Rose: low VRAM, easy to use, great results, Apache 2.0 [P]

A new PyTorch optimizer named 'Rose' has been released, promising low VRAM usage, fast convergence, and excellent generalization, licensed under Apache 2.0. Developed over several years, it aims to be easy to use and more memory-efficient than 8-bit AdamW.

deep learning machine learning VRAM Optimization optimizer

RESEARCHDEV.to AI·7d ago

Elemetry data: Running 284B MoE at 0.00 GB Active VRAM

This content shares hardware telemetry data from an architectural test evaluating frontier-scale model execution on highly constrained, commodity hardware footprints. It details benchmarking a 284B parameter Mixture-of-Experts (MoE) architecture, achieving 0.00 GB active GPU VRAM by decoupling physical weight storage from active local graphics allocation.

Hardware Telemetry DeepSeek-V4-Flash AI Model Optimization VRAM Optimization