← heapsort-ai

AI performance

27 items

ARTICLE↑ trendingReddit r/LocalLLaMA·4/15/2026

Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to layer-based single-GPU partial offload

This article details a novel dynamic expert caching strategy implemented in llama.cpp to accelerate token generation for large MoE models like Qwen3.5-122B-A10B. The approach loads frequently routed experts into VRAM, leading to up to 26.8% faster token generation compared to layer-based partial offload.

43
ARTICLE↑ trendingReddit r/LocalLLaMA·4/19/2026

Is anyone getting real coding work done with Qwen3.6-35B-A3B-UD-Q4_K_M on a 32GB Mac in opencode, claude code or similar?

A user is attempting to perform real coding tasks with Qwen3.6-35B on a 32GB M2 Macbook Pro, encountering memory exhaustion and context window management issues. Despite the model identifying the essence of a bug, it struggles with implementation as critical information is lost during context compaction.

39
ARTICLE↑ trendingReddit r/LocalLLaMA·4/21/2026

An actual example of "If you dont run it, you dont own it" and Gemma 4 beats both Chat GPT and Gemini Chat

The author shares their experience using various AI models (GPT OOS 120B, Qwen 3 Max, Chat GPT 4o) for translating a Chinese novel, highlighting challenges with name consistency and unexpected censorship. Chat GPT 4o was initially the best for accuracy and translation quality, but some models showed degradation or filtering over time.

35
ARTICLEDEV.to AI·3d ago

<think>

This content outlines requirements for a technical article analyzing AI model performance and pricing, focusing on metrics like TTFT and tokens/sec. It specifies the inclusion of exact pricing and model data, test regions, and code examples for a global API, targeting a backend engineer audience.

30
RESEARCHarXiv CS.AI·5/4/2026

Are Tools All We Need? Unveiling the Tool-Use Tax in LLM Agents

This research challenges the assumption that tool-augmented reasoning always improves LLM performance, showing that it can underperform native CoT due to a "tool-use tax" from the tool-calling protocol, especially with semantic noise. A Factorized Intervention Framework is proposed to analyze this, and G-STEP is introduced as a partial mitigation for protocol-induced errors.

28