AI Model Optimization

2 items

DOC↑ trendingReddit r/LocalLLaMA·4/15/2026

Qwen3.5-35B running well on RTX4060 Ti 16GB at 60 tok/s

The author shares a successful optimization for running the Qwen3.5-35B-A3B-UD-Q4_K_L model on an RTX 4060 Ti 16GB using llama.cpp, achieving 40-60 tokens/s with 64k context. The post provides the detailed `models.ini` configuration and server start command to replicate this performance.

Hardware Acceleration AI Model Optimization llama.cpp local inference

RESEARCHDEV.to AI·7d ago

Elemetry data: Running 284B MoE at 0.00 GB Active VRAM

This content shares hardware telemetry data from an architectural test evaluating frontier-scale model execution on highly constrained, commodity hardware footprints. It details benchmarking a 284B parameter Mixture-of-Experts (MoE) architecture, achieving 0.00 GB active GPU VRAM by decoupling physical weight storage from active local graphics allocation.

Hardware Telemetry DeepSeek-V4-Flash AI Model Optimization VRAM Optimization