vLLM

14 items

ARTICLE↑ trendingReddit r/LocalLLaMA·4/11/2026

Intel Arc Pro B70 32GB performance on Qwen3.5-27B@Q4

The Intel Arc Pro B70 32GB card achieved ~12 tps for single queries and 135 tps with 32 concurrent requests on Qwen3.5-27B@Q4, which is 20% less than the RTX PRO 4500. Furthermore, it consumed 50% more power under high concurrency, with tensor parallelism degrading performance while pipeline parallelism improved it.

Qwen3.5 llama.cpp GPU performance Intel Arc Pro B70

DOC↑ trendingReddit r/LocalLLaMA·4/11/2026

Run Qwen3.5-397B-A13B with vLLM and 8xR9700

This document details the optimized execution of the Qwen3.5-397B-A17B-MXFP4 model using vLLM on RDNA4 GPUs, such as 8xR9700. It provides a Dockerfile with Triton patches and instructions for downloading the model and launching the inference container.

Docker GPU MXFP4 Qwen

ARTICLE↑ trendingReddit r/LocalLLaMA·4/30/2026

Follow-up: Qwen3.6-27B on 1× RTX 3090 — pushing to ~218K context + ~50–66 TPS, tool calls now stable (PN12 fix)

This update details running Qwen3.6-27B on a single RTX 3090, achieving ~218K context and stable tool calls at 50-66 TPS. A critical memory issue with long tool outputs was resolved by fixing an anchor drift in a Genesis patch (PN12) for vLLM.

Optimization hardware performance vLLM

CASE↑ trendingReddit r/LocalLLaMA·4/15/2026

DGX Spark just arrived — planning to run vLLM + local models, looking for advice

A new DGX Spark owner is seeking advice on configuring it for local LLM inference, planning to use vLLM, PyTorch, and Hugging Face models for a private API backend. They are looking for recommendations on efficient models, tuning tips for vLLM on unified memory systems, and real-world throughput insights.

DGX Spark On-prem AI LLM inference PyTorch

DGX Spark just arrived — planning to run vLLM + local models, looking for advice

ARTICLEDEV.to AI·4/8/2026

Beyond the VM: Why vLLM and FlashAttention need Bare Metal GPUs 🚀

Este conteúdo técnico explica por que VMs em nuvem prejudicam a inferência de LLMs com frameworks como vLLM e FlashAttention, citando problemas como jitter de batching e gargalos de virtualização. Argumenta-se que GPUs bare metal são cruciais para o desempenho ideal em produção, preservando otimizações e a largura de banda do NVLink.

FlashAttention Virtualization GPU infrastructure

DOCDeepLearning.AI (YouTube)·6d ago

Optimize, deploy, and benchmark an open-source LLM with vLLM

This content describes how to effectively optimize, deploy, and benchmark open-source Large Language Models (LLMs) using the vLLM library. It provides practical guidance for improving the performance and efficiency of LLM deployments.

Optimization deployment Benchmarking vLLM

Optimize, deploy, and benchmark an open-source LLM with vLLM

DOCDEV.to AI·26d ago

How to Deploy Llama 3.2 with vLLM + Batch Processing on a $8/Month DigitalOcean Droplet: Asynchronous Inference at 1/125th Claude Cost

This article provides a detailed guide on deploying Llama 3.2 with vLLM and batch processing on a low-cost DigitalOcean Droplet. It demonstrates how to achieve asynchronous inference at significantly lower costs compared to commercial AI APIs like Claude, processing over 10,000 tokens per second for $8/month.

learning Cost Optimization Llama 3.2 LLM deployment

DOCDEV.to AI·26d ago

How to Deploy Nemotron-4 340B with vLLM on a $24/Month DigitalOcean GPU Droplet: Enterprise-Grade Reasoning at 1/130th Claude Opus Cost

This guide details how to deploy NVIDIA's Nemotron-4 340B model with vLLM on a DigitalOcean GPU Droplet for $24/month. This setup offers enterprise-grade reasoning capabilities, achieving a 99% cost reduction compared to using Claude Opus API for similar workloads.

NVIDIA Nemotron-4 learning AI deployment Cost Optimization

ARTICLEHugging Face Blog·5/6/2026

vLLM V0 to V1: Correctness Before Corrections in RL

This content discusses the transition from vLLM V0 to V1, focusing on the importance of correctness over corrections in Reinforcement Learning. It explores development principles and enhancements to ensure integrity and performance in AI systems.

LLMs reinforcement learning machine learning AI development

DOCDEV.to AI·5/9/2026

How to Deploy Qwen2.5 72B with vLLM + FastAPI on a $20/Month DigitalOcean GPU Droplet: Production Inference at 1/90th Claude Cost

This article details how to deploy the Qwen2.5 72B model on a DigitalOcean GPU Droplet for just $20/month. It offers a low-cost alternative to commercial LLM APIs, promising production inference with performance competitive to Claude 3.5 Sonnet and a 98% cost reduction.

learning Qwen2.5 Cost Optimization LLM deployment

DOCDEV.to AI·25d ago

How to Deploy Mistral Nemo with vLLM + Flash Attention on a $12/Month DigitalOcean GPU Droplet: 3x Faster Inference at 1/95th Claude Cost

This article details how to deploy the Mistral Nemo model on a $12/month DigitalOcean GPU Droplet, leveraging vLLM and Flash Attention. This approach offers 3x faster inference and a 95% cost reduction compared to commercial AI APIs like Claude, advocating for efficient self-hosting of open-source AI models.

Mistral Nemo Flash Attention AI deployment Cost Optimization

DOCDEV.to AI·26d ago

How to Deploy Qwen2.5 32B with vLLM + Quantization on a $12/Month DigitalOcean GPU Droplet: Production-Grade Inference at 1/100th Claude Cost

This content details how to deploy the Qwen2.5 32B language model using vLLM and quantization on a $12/month DigitalOcean GPU droplet. It demonstrates production-grade inference at a significantly lower cost than commercial APIs.

deployment quantization Cost Optimization vLLM

DOCAWS Machine Learning Blog·20d ago

Build real-time voice applications with Amazon SageMaker AI and vLLM

Real-time voice applications, such as voice agents and live captioning, rely on simultaneous speech-to-text transcription. Traditional request-response inference falls short, introducing latency that hinders real-time functionality.

voice applications Speech-to-Text real-time AI Amazon SageMaker

DOCDEV.to AI·8d ago

How to Deploy Llama 3.2 Vision with vLLM + Quantization on a $6/Month DigitalOcean Droplet: Multimodal Reasoning at 1/210th GPT-4 Vision Cost

This content explains how to deploy Llama 3.2 Vision with vLLM and quantization on a DigitalOcean Droplet to drastically reduce costs compared to GPT-4 Vision. It highlights production-grade multimodal inference at a fraction of the price.

multimodal AI Llama 3 AI deployment Cost Optimization