LLM inference

11 items

ARTICLE↑ trendingReddit r/LocalLLaMA·4/10/2026

Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results

O autor compartilha resultados de otimização de um servidor de inferência com duas GPUs para LLMs, alcançando 198 tok/s com o modelo Qwen3.5-122B NVFP4. O conteúdo detalha a configuração de hardware (2x RTX PRO 6000 Blackwell) e compara o desempenho de diferentes motores e modelos de linguagem.

Qwen3.5 Benchmarking GPU performance LLM inference

DOC↑ trendingReddit r/LocalLLaMA·4/27/2026

To 16GB VRAM users, plug in your old GPU

This content suggests that users with 16GB VRAM add an old GPU (6GB+ VRAM) to increase total VRAM, enabling the execution of larger LLM models (~30b) even with a weaker secondary card. It includes a practical configuration example for `llama-server`.

deep learning GPU optimization LLM inference VRAM management

CASE↑ trendingReddit r/LocalLLaMA·4/15/2026

DGX Spark just arrived — planning to run vLLM + local models, looking for advice

A new DGX Spark owner is seeking advice on configuring it for local LLM inference, planning to use vLLM, PyTorch, and Hugging Face models for a private API backend. They are looking for recommendations on efficient models, tuning tips for vLLM on unified memory systems, and real-world throughput insights.

DGX Spark On-prem AI LLM inference PyTorch

DGX Spark just arrived — planning to run vLLM + local models, looking for advice

ARTICLE↑ trendingReddit r/LocalLLaMA·4/26/2026

Thoughts on using an AMD Alveo V80 FPGA PCI card as a poor man’s Taalas HC1 (LLM-burned-onto-a-chip).

The author explores using an AMD Alveo V80 FPGA card for LLM inference, aiming to approximate the performance of a dedicated Taalas HC1 chip. After consulting Gemini Pro, potential speeds of 1,400-3,200 tk/s were estimated, and the author seeks community input on this speculative approach.

AI hardware FPGA LLM inference

Thoughts on using an AMD Alveo V80 FPGA PCI card as a poor man’s Taalas HC1 (LLM-burned-onto-a-chip).

ARTICLEDEV.to AI·18d ago

RAM Coffers: NUMA-Aware LLM Inference — Why Hardware Topology Still Matters

The article discusses how NUMA memory topology, not just VRAM, is a critical bottleneck for LLM inference on multi-socket servers, causing significant throughput degradation. RustChain's RAM Coffers solves this by detecting NUMA topology and optimizing memory allocation and thread pinning for predictable, enhanced performance.

multi-socket servers NUMA LLM inference hardware optimization

ARTICLEDEV.to AI·4/16/2026

"The Real Cost of AI Compute: Why Your Agent's Token Budget Is Your Lifeline"

This article highlights the critical and often underestimated financial impact of AI compute, particularly token usage, when deploying AI agents in production. It emphasizes that token budgets, rather than feature roadmaps, define an agent's true operational limits due to direct costs and overheads like RAG.

AI costs AI deployment LLM inference Cost Optimization

DOCDEV.to AI·26d ago

Laravel Horizon in Production: Configuring AI Queue Workloads That Actually Hold

This guide addresses the challenges of configuring Laravel Horizon for AI inference workloads in production, where standard queue job defaults fail due to the extended processing times of LLMs. It explains how to prevent silent timeouts and job failures that occur when Horizon's default settings are not adapted for long-running AI tasks.

queue management production operations AI deployment LLM inference

DOCAWS Machine Learning Blog·11d ago

Comprehensive observability for Amazon SageMaker AI LLM inference: From GPU utilization to LLM quality

This post showcases a comprehensive observability solution utilizing Amazon Managed Grafana dashboards. It provides a holistic view of both the quality and quantity of LLMs served on Amazon SageMaker AI inference endpoints.

Grafana AI Monitoring LLM inference observability

ARTICLEDEV.to AI·4/8/2026

99.8% of LLM Inference Power Isn't Spent on Computation

O artigo argumenta que o consumo de energia é o maior gargalo na inferência de LLMs, mais do que largura de banda ou VRAM, devido a limitações físicas. Isso se deve ao colapso da Lei de Dennard por volta de 2006, que impediu a redução automática do consumo de energia com o encolhimento dos transistores.

power consumption Bandwidth AI hardware VRAM

RESEARCHarXiv CS.LG·4/6/2026

Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers

Este estudo caracteriza a sobrecarga de despacho do WebGPU para inferência de LLM em diversas plataformas de GPU, backends e navegadores. Ele revela que benchmarks simples superestimam os custos e identifica o verdadeiro custo por despacho da API WebGPU, destacando a necessidade dessa distinção para otimizações eficazes.

neural networks Optimization browsers Overhead

NEWSDEV.to AI·4/15/2026

AWS Speed Boosts, Agentic Limits, and Clinical AI Advances

AWS is optimizing LLM inference with speculative decoding on Trainium and vLLM, and the Spring AI SDK for Bedrock AgentCore is now generally available. New research also explores agentic system failures, CNN uncertainty quantification, and LLMs' role in clinical reasoning.

Clinical AI AWS LLM inference Agentic AI