LLM optimization

17 items

DOC↑ trendingReddit r/LocalLLaMA·5/6/2026

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

This content details how to achieve 2.5x faster inference with Qwen 3.6 27B using MTP support in llama.cpp, enabling 28 tok/s on an M2 Max. It provides converted GGUF files for download, suitable for local agentic coding with 262k context on 48GB.

LLM optimization llama.cpp GGUF Qwen

ARTICLE↑ trendingReddit r/MachineLearning·4/12/2026

KIV: 1M token context window on a RTX 4070 (12GB VRAM), no retraining, drop-in HuggingFace cache replacement - Works with any model that uses DynamicCache [P]

KIV (K-Indexed V Materialization) is a middleware layer that replaces the standard HuggingFace KV cache with a tiered retrieval system, moving old data to system RAM. This enables 1M token context windows on an RTX 4070 (12GB VRAM) with only 12MB VRAM overhead and good performance.

KIV LLM optimization Context window VRAM

ARTICLE↑ trendingReddit r/LocalLLaMA·4/18/2026

LM Studio CPU thread pool size vs. tk/s with some MoE layers offloaded to CPU

This content analyzes the relationship between the CPU thread pool size in LM Studio and token generation speed (tk/s). It specifically focuses on scenarios where some Mixture of Experts (MoE) layers are offloaded to the CPU to optimize performance.

LLM optimization CPU performance MoE LM Studio

LM Studio CPU thread pool size vs. tk/s with some MoE layers offloaded to CPU

RESEARCH↑ trendingReddit r/LocalLLaMA·19d ago

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp

The author achieved 110 tok/s with 12GB VRAM using ik_llama.cpp on the Qwen3.6 35B A3B model, noting a significant speed boost. This performance surpassed that of regular llama.cpp after its MTP PR merge.

GPU VRAM LLM optimization llama.cpp Benchmarking

ARTICLE↑ trendingReddit r/LocalLLaMA·5/6/2026

Uploaded Unsloth Qwen3.6-35B-A3B UD XL models with MTP grafted, here are the results

This post reports the results of the 35B A3B version of the Qwen3.6-35B-A3B UD XL models with MTP grafted, now available on HuggingFace. Initial tests showed limited speed gains (6% for Q4, 2.5% for Q8) on some setups, though other users reported more significant improvements (up to 50%) depending on their hardware.

AI models LLM optimization GGUF performance testing

RESEARCHarXiv CS.CL·4/17/2026

Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models

This paper proposes a unified compressed-sensing-guided framework for dynamic LLM execution, addressing the massive parameter counts, memory use, and decoding latency of large language models. It integrates model and prompt compression by using random measurement operators and sparse recovery to estimate task-conditioned and token-adaptive support sets.

Model Compression LLM optimization sparse recovery compressed sensing

ARTICLEDEV.to AI·23d ago

How I Cut My LangGraph Agent's Token Costs by 93% with One Import

This article details how to reduce LangGraph agent token costs by 93% by addressing its stateless nature, which leads to redundant computation. The author discovered that over 90% of graph traversal was identical across runs, resulting in paying for work the agent had already done.

LangGraph LLM optimization token costs Cost Efficiency

ARTICLEDEV.to AI·19d ago

Stop Routing Your Prompts Through Shady AI Proxies: How to Compress LLM Tokens Locally in Node.js

This article warns against using third-party AI proxies for cost optimization, citing serious security risks to proprietary and customer data. It proposes a local solution for LLM token compression within a Node.js runtime, eliminating the need for unverified middlemen.

LLM optimization data privacy security Node.js

RESEARCHarXiv CS.LG·4/23/2026

Accelerating PayPal's Commerce Agent with Speculative Decoding: An Empirical Study on EAGLE3 with Fine-Tuned Nemotron Models

This paper evaluates speculative decoding with EAGLE3 as an inference-time optimization for PayPal's Commerce Agent, powered by fine-tuned Nemotron models. The study demonstrates significant performance improvements, including 22-49% throughput increase and 18-33% latency reduction at zero additional hardware cost.

Performance benchmarking LLM optimization Inference acceleration large language models

ARTICLEDEV.to AI·4/16/2026

"The Hidden Cost of AI Compute: Why Token Efficiency is Your Competitive Advanta

The article highlights the significant, often overlooked, financial cost of AI compute, especially for large language models like GPT-4, due to token consumption. It argues that most implementations are wasteful, with inefficient prompting and system design leading to unnecessary spending that can be 3-5x higher than required.

AI costs prompt-engineering LLM optimization cloud computing

RESEARCHDEV.to AI·20d ago

How Far Can a Small Coding Model Go With a Better Harness?

The article explores the performance of a small coding model (GPT-5.1-Codex-Mini) on Terminal-Bench 2.0, achieving a 61.6% score by optimizing its "harness" rather than swapping for a larger model. It highlights that the model's wrapper plays a crucial role in performance, especially evident when using smaller models where harness mistakes have a greater impact.

model performance LLM optimization Benchmarking code generation

RESEARCHDEV.to AI·22d ago

Three Months of Speed-Up Experiments on a 3090 Ti: Autoregressive DFlash MTP for Qwen3.6-27B

This content details a three-month experiment aimed at optimizing the decode performance of the Qwen3.6-27B model on an RTX 3090 Ti GPU. The project successfully improved decoding speed from 43 to 39-49 tokens per second, leveraging a new speculative decoding technique (MTP) within llama.cpp.

LLM optimization llama.cpp Qwen3.6-27B GPU performance

RESEARCHDEV.to AI·5/9/2026

Adaptive reasoning reduces token usage up to 90% with minimal accuracy loss

Adaptive reasoning formats enable AI models to dynamically decide necessary reasoning steps, slashing token usage by up to 90% with minimal accuracy loss. This method replaces monolithic computation chains with lightweight, dynamically chosen alternatives, overcoming the cost inefficiencies of parallel reasoning.

Visual-language systems LLM optimization Token reduction AI Efficiency

ARTICLEDEV.to AI·4/14/2026

I Open-Sourced the Most Overkill Claude Code Setup — 15 Agents, 17 Hooks, 60-99% Token Savings

The author open-sourced an advanced system called "claude-god-mode" to optimize Claude Code usage, addressing high token consumption and poor code quality issues. This system combines multiple optimization layers and 15 specialized agents, resulting in 60-99% token savings and improved generated code quality.

Open Source LLM optimization Claude code generation

ARTICLEDEV.to AI·4/24/2026

i burnt $127 in api credits before i fixed these openclaw mistakes

The author recounts burning $127 in API credits due to an AI agent (OpenClaw) looping inefficiently and misusing high-cost models for simple tasks. They fixed this by implementing tiered model configurations, assigning appropriate AI models to specific tasks to optimize performance and cut costs.

LLM optimization Cost Optimization AI development AI agents

ARTICLEDEV.to AI·4/10/2026

Most of your Claude Code agents don't need Sonnet

O artigo apresenta uma estratégia de roteamento de 3 níveis para otimizar o custo de chamadas de agentes Claude Code, direcionando tarefas para o modelo de IA mais barato e adequado. O autor utiliza modelos caros como Sonnet apenas para tarefas que exigem raciocínio profundo, enquanto tarefas mais simples são atribuídas a modelos mais acessíveis como Haiku e Ollama.

cost management model routing LLM optimization Claude

NEWSDEV.to AI·4/13/2026

Prompt Quality Score (PQS) Now Supports x402 Payments on Solana

Prompt Quality Score (PQS), a pre-flight quality gate for AI agent workflows, now accepts x402 payments on Base mainnet and Solana. PQS evaluates prompts across 8 dimensions, providing a score and fixes to optimize prompt quality and save on expensive LLM token usage.

LLM optimization Prompt Quality Blockchain Payments Solana