← heapsort-ai

vLLM

14 items

DOC↑ trendingReddit r/LocalLLaMA·4/11/2026

Run Qwen3.5-397B-A13B with vLLM and 8xR9700

This document details the optimized execution of the Qwen3.5-397B-A17B-MXFP4 model using vLLM on RDNA4 GPUs, such as 8xR9700. It provides a Dockerfile with Triton patches and instructions for downloading the model and launching the inference container.

42
DOCDEV.to AI·26d ago

How to Deploy Llama 3.2 with vLLM + Batch Processing on a $8/Month DigitalOcean Droplet: Asynchronous Inference at 1/125th Claude Cost

This article provides a detailed guide on deploying Llama 3.2 with vLLM and batch processing on a low-cost DigitalOcean Droplet. It demonstrates how to achieve asynchronous inference at significantly lower costs compared to commercial AI APIs like Claude, processing over 10,000 tokens per second for $8/month.

27
DOCDEV.to AI·25d ago

How to Deploy Mistral Nemo with vLLM + Flash Attention on a $12/Month DigitalOcean GPU Droplet: 3x Faster Inference at 1/95th Claude Cost

This article details how to deploy the Mistral Nemo model on a $12/month DigitalOcean GPU Droplet, leveraging vLLM and Flash Attention. This approach offers 3x faster inference and a 95% cost reduction compared to commercial AI APIs like Claude, advocating for efficient self-hosting of open-source AI models.

27