Inference Optimization

11 items

ARTICLE↑ trendingReddit r/LocalLLaMA·4/19/2026

Unweight: how we compressed an LLM 22% without sacrificing quality

Cloudflare developed Unweight, a lossless compression system that shrinks LLM weights by 15-22% to overcome GPU memory bandwidth bottlenecks during inference. It achieves this by using Huffman coding to compress the predictable exponent bytes of BF16 weights, preserving bit-exact outputs.

GPU optimization lossless compression LLM compression Inference Optimization

Unweight: how we compressed an LLM 22% without sacrificing quality

ARTICLEDEV.to AI·4/19/2026

The Rise of Inference Optimization: The Real LLM Infra Trend Shaping 2026

The content highlights inference optimization as the critical trend shaping LLM infrastructure by 2026, emphasizing its importance over model size. It explains that while training is a one-time cost, inference is an ongoing expense that directly impacts margins and user experience, making efficiency paramount.

quantization AI infrastructure Inference Optimization Cost Efficiency

RESEARCHarXiv CS.CL·4/22/2026

Two-dimensional early exit optimisation of LLM inference

This paper introduces a two-dimensional early exit strategy for LLM classification tasks, coordinating layer-wise and sentence-wise exiting. The method achieves multiplicative computational savings and speed-ups of 1.4-2.3x over optimal layer-wise early exit for simpler tasks, applicable across various state-of-the-art LLMs.

LLMs Computational Efficiency Inference Optimization

RESEARCHarXiv CS.CL·7d ago

SENSE: Semantic Embedding Navigation with Soft-gated Evaluation for Retrieval-based Speculative Decoding

This paper proposes SENSE (Semantic Embedding Navigation with Soft-gated Evaluation) to enhance Retrieval-based Speculative Decoding (RSD) for LLMs. SENSE addresses RSD's rigid lexical dependencies by using robust semantic alignment and a soft-gated evaluation module to validate semantic equivalence.

LLMs NLP Inference Optimization Speculative Decoding

RESEARCHarXiv CS.CL·4/23/2026

TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference

TTKV proposes a temporal-tiered KV cache management framework for LLMs, inspired by human memory, to address the linear scaling of KV cache memory. It partitions the cache into tiers with heterogeneous capacity and precision, assigning more recent KV states to faster, higher-precision tiers.

neural networks LLMs memory management Inference Optimization

ARTICLEDEV.to AI·4/15/2026

The Hidden Cost of Running LLM Applications at Scale

This article discusses the common problem of LLM production costs escalating unexpectedly, explaining that the cause is not the direct model cost but rather early design decisions. A key mistake identified is using a single expensive inference endpoint for all request types, without optimization.

multi-tenant LLM production systems LLM costs AI economics

ARTICLEDEV.to AI·4/26/2026

DeepSeek V4: Million-Token Context That Actually Works

DeepSeek V4 delivers a 1 million-token context that is actually usable, solving the GPU memory issue with a hybrid attention architecture that compresses the KV cache by nearly 9x. This makes it a practical solution for long-context inference, unlike many other models.

DeepSeek AI models Model Architecture large language models

RESEARCHarXiv CS.CL·5/1/2026

Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling

This paper introduces the Length Value Model (LenVM), a novel token-level framework for modeling the remaining generation length in autoregressive models. By formulating length modeling as a value estimation problem, LenVM provides an annotation-free, scalable, and effective signal for LLMs and VLMs, improving performance on exact length matching tasks.

deep learning Model Architecture computer vision large language models

RESEARCHarXiv CS.CL·4/30/2026

SpecTr-GBV: Multi-Draft Block Verification Accelerating Speculative Decoding

SpecTr-GBV is a novel speculative decoding method that unifies multi-draft and greedy block verification to accelerate language model inference. It formulates the verification step as an optimal transport problem, improving both theoretical efficiency and empirical performance by achieving the optimal expected acceptance length.

large language models Inference Optimization Speculative Decoding AI Research

RESEARCHarXiv CS.CL·4/24/2026

TRACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping

This paper introduces TRACES, a lightweight framework designed to optimize Language Reasoning Models (LRMs) by tagging reasoning steps in real-time. It enables adaptive, cost-efficient early stopping of LRM inferences, addressing their current inefficiency and over-generation of verification steps.

LLMs early stopping Reasoning Inference Optimization

RESEARCHarXiv CS.CL·4/21/2026

Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM

This research evaluates cross-family speculative decoding for Polish LLMs on Apple Silicon, extending the MLX-LM framework with Universal Assisted Generation (UAG) for cross-tokenizer compatibility. Experiments show that context-aware token translation significantly improves acceptance rates for Bielik 11B on Polish language datasets.

apple-silicon natural language processing Inference Optimization Speculative Decoding