inference

28 items

RESEARCHarXiv CS.LG·1d ago

Skip a Layer or Loop It? Learning Program-of-Layers in LLMs

This research proposes "program-of-layers (PoLar)" for LLMs, enabling dynamic skipping or looping of pretrained layers during inference to achieve better or equivalent accuracy with shorter execution paths. A lightweight prediction network learns to generate these customized programs, demonstrating improved performance on mathematical reasoning benchmarks.

neural networks mathematical reasoning inference LLMs

ARTICLE↑ trendingReddit r/MachineLearning·4/23/2026

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

The user is optimizing a Transformer model for size and inference speed, having plateaued after FP16 conversion and ONNX optimization, with pruning yielding limited gains. They are seeking advice on advanced techniques like low-rank factorization, aggressive quantization (INT8/INT4), knowledge distillation, or hardware-specific optimizations to achieve further real-world improvements.

Pruning inference Transformer quantization

ARTICLE↑ trendingReddit r/MachineLearning·4/22/2026

I built a new category of AI called a Reductive Inference Model (RIM) that answers by elimination instead of generation — AMA [P]

POEM (Process Of Elimination Master) is a novel AI architecture that answers questions by progressively eliminating impossibilities rather than generating possibilities, operating independently of LLMs. It achieves 88% accuracy, is 95.5x faster, and 100x smaller than TinyLlama 1.1B, demonstrating significant computational efficiency.

AI architecture inference Computational Efficiency sustainable AI

ARTICLE↑ trendingHacker News (AI)·11d ago

DeepSeek Slashes AI Costs to Cents

DeepSeek has dramatically reduced the costs of AI inference, bringing them down to mere cents. This development makes AI technology more accessible and economically viable for a wider range of applications.

DeepSeek AI costs inference cost reduction

RESEARCH↑ trendingReddit r/LocalLLaMA·4/16/2026

Qwen 3.6 35B A3B, RTX 5090 32GB, 187t/s, Q5 K S, 120K Context Size, Thinking Mode Off, Temp 0.1

The content details the performance of the Qwen 3.6 35B A3B model, achieving 187 tokens per second on an RTX 5090 32GB GPU. It highlights support for a 120K context size, using Q5 K S quantization and a temperature of 0.1.

inference AI hardware benchmark performance

Qwen 3.6 35B A3B, RTX 5090 32GB, 187t/s, Q5 K S, 120K Context Size, Thinking Mode Off, Temp 0.1

NEWS↑ trendingReddit r/LocalLLaMA·4/27/2026

Skymizer Taiwan Inc. Unveils Breakthrough Architecture Enabling Ultra-Large LLM Inference on a Single Card

Skymizer Taiwan Inc. has unveiled a breakthrough architecture, the HTX301 card, that allows 700B-parameter LLM inference on a single PCIe card with 384GB memory and low power consumption (~240W). This approach offloads decoding to the HTX301 while GPUs handle prefill, enabling ultra-large LLM inference locally without massive GPU VRAM.

inference LLMs AI hardware

CASE↑ trendingReddit r/MachineLearning·4/27/2026

INT8 quantization gives me better accuracy than FP16 ! [D]

A user observed that INT8 quantization in their deep learning model yielded better inference accuracy than FP16, which was unexpected. They are seeking explanations for INT8's superior performance over FP16.

inference ONNX deep learning quantization

NEWSDEV.to AI·4/22/2026

Google Launches AI Chips for Training and Inference

Google has launched a new line of AI chips, codenamed "Triton X," to directly challenge Nvidia's dominance, promising a 40% cost reduction for training tasks and 25% lower inference latency. This move signifies a seismic shift in the AI hardware market, intensifying competition.

inference AI hardware Training Google

RESEARCHarXiv CS.CL·5d ago

Expert-Aware Refusal Steering

This paper extends refusal steering to Mixture-of-Experts (MoE) Large Language Models, finding that steering performance is not hindered by the MoE architecture. It proposes expert-aware refusal steering methods that leverage expert routing patterns, demonstrating that refusal behavior can be effectively steered based on a single expert's output.

MoE models inference refusal steering AI alignment

ARTICLEDEV.to AI·4/15/2026

I Ran 163 Benchmarks Across 10 LLMs So You Don't Have To. Here's What I Found

This article highlights the common practice of teams overpaying for LLM inference due to a lack of proper benchmarking, often picking models based on popularity rather than cost-efficiency. The author, using a tool called CostGuard, ran 163 benchmarks across 15 models, uncovering surprising price differences of up to 200x between models like Gemini 2.5 Flash and GPT-5.

AI models inference Benchmarking Cost Optimization

RESEARCHarXiv CS.LG·4/20/2026

The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference

This research reveals that KV caching in autoregressive transformer inference, under standard FP16 precision, causes a systematic divergence in decoded token sequences due to different floating-point accumulation orders. Across LLaMA-2-7B, Mistral-7B, and Gemma-2-2B, a 100% token divergence rate was observed, with cache-ON often leading to higher accuracy.

AI models inference LLMs numerical precision

ARTICLEDEV.to AI·12d ago

The Inference Layer

Three AI inference infrastructure startups are collectively raising at over $30 billion, showcasing rapid growth in a sector that barely existed 18 months ago. Companies like Baseten, Fireworks AI, and Modal Labs are achieving multi-billion dollar valuations despite recent revenue milestones.

inference startups enterprise computing Valuation

ARTICLEDEV.to AI·5/3/2026

I wrote a custom CUDA inference engine to run Qwen3.5-27B on $130 mining cards

A developer created a custom CUDA inference engine to successfully run the Qwen3.5-27B large language model on low-cost, repurposed mining graphics cards. This innovative approach demonstrates significant hardware optimization, making powerful AI models more accessible on affordable consumer-grade hardware.

CUDA Optimization inference hardware

ARTICLEDEV.to AI·26d ago

{"title": "How I Cut My LLM Inference Costs by 40% While Handling 5x More Reques

This article details how a team significantly reduced their LLM inference costs by 40% while increasing request capacity fivefold. The solution involved rebuilding their architecture with a lightweight proxy layer to normalize requests to an OpenAI-compatible format, allowing flexible use of various high-performance providers.

Optimization inference cost reduction Architecture

DOCDEV.to AI·7d ago

How to Deploy Mistral 7B with vLLM + KServe on a $10/Month DigitalOcean GPU Droplet: Production-Ready Inference at 1/95th Claude Cost

This guide details deploying Mistral 7B with vLLM and KServe on a $10/month DigitalOcean GPU Droplet, enabling production-ready inference at a drastically reduced cost. This solution offers a 95% saving compared to commercial AI APIs, ensuring high concurrency and low latency.

inference deployment learning Cost Optimization

ARTICLEDEV.to AI·4/21/2026

Multi-Model LLM Routing: Why 76% of Your Inference Shouldn't Touch GPT-4

This article advocates for intelligent LLM request routing to optimize production costs and performance. It suggests directing 76% of requests to cheaper, faster models, reserving frontier models like GPT-4 for the 24% of complex tasks that genuinely require them.

inference model routing Cost Optimization AI agents

ARTICLEDEV.to AI·4/24/2026

How to Deploy Llama 3.2 70B with TensorRT-LLM on a $48/Month DigitalOcean GPU Droplet: 3x Faster Inference Than vLLM

This content describes how to deploy Llama 3.2 70B using TensorRT-LLM on a $48/month DigitalOcean GPU droplet, achieving 3x faster inference than vLLM. It highlights significant cost savings and performance improvements for self-hosting production chatbots compared to OpenAI API costs.

inference LLMs self-hosting Performance optimization

RESEARCHDEV.to AI·5/8/2026

Model Showdown Round 2: Adding Gemma, Kimi, and 579 GB of Stubborn Optimism

This article presents "Model Showdown Round 2," introducing new models like Google's Gemma 4 and Moonshot AI's Kimi K2, and re-evaluating previous models with corrected configurations. The updated benchmarks revealed significant changes in the leaderboard, addressing issues like token limits and command interpretation from the initial round.

AI models inference LLMs Benchmarking

RESEARCHarXiv CS.LG·4/9/2026

$S^3$: Stratified Scaling Search for Test-Time in Diffusion Language Models

O trabalho propõe $S^3$ (Stratified Scaling Search), um método de busca guiado por verificador para melhorar a qualidade de geração em modelos de linguagem de difusão durante o tempo de inferência. Ele realoca a computação no processo de denoising, avaliando e reamostrando seletivamente candidatos promissores para favorecer saídas de maior qualidade.

Diffusion Models search algorithms language models inference

RESEARCHarXiv CS.AI·5/7/2026

Parallel Prefix Verification for Speculative Generation

PARSE (PArallel pRefix Speculative Engine) is a new speculative generation framework that accelerates large language model (LLM) inference. It achieves this by parallelizing prefix verification on a semantic level, overcoming existing limitations by evaluating correctness across multiple prefixes in a single forward pass.

inference AI acceleration parallelization Speculative Decoding