AI inference

28 items

ARTICLE↑ trendingHacker News (AI)·6d ago

Lean Inference: Lean Manufacturing Principles Applied to AI

This article explores the application of Lean Manufacturing principles to AI inference, aiming to optimize efficiency and reduce waste in artificial intelligence workflows. It details how lean methodologies can be utilized to improve the performance and sustainability of AI systems.

MLOps Optimization Lean Manufacturing efficiency

DOC↑ trendingReddit r/LocalLLaMA·5/6/2026

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

This content details how to achieve 2.5x faster inference with Qwen 3.6 27B using MTP support in llama.cpp, enabling 28 tok/s on an M2 Max. It provides converted GGUF files for download, suitable for local agentic coding with 262k context on 48GB.

LLM optimization llama.cpp GGUF Qwen

RESEARCH↑ trendingReddit r/LocalLLaMA·5/7/2026

ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

ParoQuant is a novel technique that employs pairwise rotation quantization to significantly improve the efficiency of Large Language Model (LLM) inference. This method specifically targets reasoning LLMs, enabling more cost-effective and faster deployment by reducing computational and memory requirements.

Optimization LLMs efficiency quantization

ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

NEWS↑ trendingReddit r/LocalLLaMA·4/19/2026

llama.cpp speculative checkpointing was merged

Speculative checkpointing has been merged into llama.cpp, potentially offering speedups for certain prompts. While some prompts, like for coding with optimized parameters, can see 0-50% speed improvement, others may not benefit due to low draft acceptance streaks.

Open Source llama.cpp speculative-checkpointing AI inference

ARTICLE↑ trendingReddit r/LocalLLaMA·28d ago

I got a real transformer language model running locally on a stock Game Boy Color!

A transformer language model (TinyStories-260K) was successfully run locally on a stock Game Boy Color, utilizing INT8 weights and fixed-point math. This impressive technical feat involved a custom ROM and on-device tokenization, though performance is extremely slow and output is gibberish.

Hardware Acceleration Edge AI quantization AI inference

I got a real transformer language model running locally on a stock Game Boy Color!

ARTICLEDEV.to AI·15d ago

The Quiet AI War Inside Your Browser

Google launched the Prompt API in Chrome 148, enabling local AI inference with Gemini Nano directly on users' devices, despite strong opposition from Mozilla, Apple, and the W3C. This feature provides AI without server costs, latency, or data leaving the device, cementing Google's win in this

Google Chrome Web Standards Gemini Nano AI inference

DOCDEV.to AI·22d ago

Building llama.cpp from source on a Dell Precision T5820 with an RTX 3090 Ti (after seven power cycles)

This post details setting up a Dell Precision T5820 with an RTX 3090 Ti for AI inference using llama.cpp to run Qwen3.6-27B. The author shares the build recipe, PCIe troubleshooting, and long-context tricks, highlighting patience as a crucial fix.

Homelab GPU Troubleshooting llama.cpp

NEWSAWS Machine Learning Blog·5d ago

NVIDIA Nemotron 3 Ultra now available on Amazon SageMaker JumpStart

NVIDIA Nemotron 3 Ultra is now available on Amazon SageMaker JumpStart. This deployment offers 5x faster inference and 30% lower cost for AI workloads.

Nemotron 3 Ultra machine learning NVIDIA AI inference

DOCDEV.to AI·23d ago

How to Fast Ai Inference with itapi.ai: A Complete Guide [May 2026]

This guide details how itapi.ai simplifies fast AI inference, offering a robust, developer-friendly API that reduces integration time. It provides a step-by-step process for getting started, including creating a free account and installing the official SDK.

development tutorial API SDK

DOCDEV.to AI·24d ago

A Developer's Guide to AI Inference Costs in 2026

This practical guide assists developers in estimating AI inference costs, addressing factors like API token cost and the crucial cache-hit rate. For self-hosted models, it emphasizes the importance of GPU utilization rates to optimize expenses. Understanding these variables is essential for financial sustainability in AI feature development.

Optimization cloud computing costs AI inference

DOCHugging Face Blog·29d ago

Building Blocks for Foundation Model Training and Inference on AWS

The content discusses the essential building blocks for training and inference of foundation models on the AWS platform. It explores the necessary components for implementing and operating these models.

AI training machine learning Foundation Models AWS

ARTICLEDEV.to AI·7d ago

Request-Based vs Token Pricing for LLM Inference in 2026

The content discusses the evolving pricing models for LLM inference by 2026, shifting from token-based to request-based billing. While token-based pricing becomes unpredictable with large context windows and agentic workflows, a flat fee per API call offers cost certainty.

cost management LLM pricing AI inference API billing

ARTICLEDEV.to AI·4/19/2026

Cloudflare Workers AI: Run Edge Inference Without a GPU Server

Cloudflare Workers AI enables running AI inference at the edge without needing GPU servers, offering over 50 models and billing per inference unit. This service simplifies AI-native app development by providing global low-latency inference on Cloudflare's GPU network, eliminating cold starts and server management.

cloud computing machine learning Serverless AI inference

DOCDEV.to AI·18d ago

在老旧 AMD RX 580 (8GB) 上通过原生 Vulkan 运行 Flux Schnell (12B) + LLM — 完整架构指南 [2026]

This technical guide demonstrates running LLMs and Stable Diffusion models on an old AMD RX 580 GPU in 2026, bypassing AI software limitations. It details the use of native Vulkan with the ggml engine for efficient inference, proving the viability of older hardware.

Vulkan hardware ggml AI inference

RESEARCHarXiv CS.AI·5/4/2026

Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference

TokenArena is introduced as a continuous benchmark that measures AI inference at endpoint granularity across five core axes. It synthesizes output speed, time to first token, price, effective context, and quality, along with energy estimates, into composites like joules and dollars per correct answer and endpoint fidelity.

AI models Energy Efficiency performance evaluation Benchmarking

RESEARCHarXiv CS.LG·20d ago

UCCI: Calibrated Uncertainty for Cost-Optimal LLM Cascade Routing

UCCI is an innovative router that uses calibrated uncertainty to optimize the cost of LLM cascades, sending easy queries to smaller models and difficult ones to larger models. It reduces inference cost by 31% on production workloads while maintaining accuracy, by calibrating model confidence.

LLM routing uncertainty calibration model cascades Cost Optimization

RESEARCHarXiv CS.CL·12d ago

EvoSpec: Evolving Speculative Decoding via Real-Time Vocabulary and Parameter AdaptationTarget

EvoSpec introduces a framework for real-time evolution of draft models in speculative decoding for Large Language Models, addressing the bottleneck of large vocabulary sizes. It uses dynamic vocabulary and parameter adaptation, employing a context-aware mechanism and a lightweight online alignment strategy to improve acceptance rates and minimize distributional gaps.

Optimization machine learning large language models AI inference

ARTICLE↑ trendingReddit r/LocalLLaMA·4/21/2026

llama.cpp is the linux of llm

The content posits that llama.cpp serves a role akin to Linux for Large Language Models, suggesting it's a foundational and open-source platform. It questions whether this analogy accurately describes llama.cpp's significance in the LLM ecosystem.

Open Source AI inference LLM

RESEARCHarXiv CS.LG·4/30/2026

Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective

This work rethinks KV cache eviction for LLMs using an information-theoretic objective derived from the Information Bottleneck principle. It introduces CapKV, a new capacity-aware method that preserves information, outperforming existing heuristic strategies.

Memory Optimization machine learning large language models AI inference

ARTICLETogether AI Blog·5/8/2026

Serving DeepSeek-V4: why million-token context is an inference systems problem

DeepSeek-V4 makes million-token context a significant inference systems problem. Together AI is exploring the inference work on NVIDIA HGX B200, focusing on solutions like compressed KV layouts and prefix caching for long-context workloads.

long-context models DeepSeek V4 NVIDIA AI inference