GPU

46 items

NEWS↑ trendingReddit r/LocalLLaMA·4/18/2026

Cloudflare open-sources lossless LLM compression tool

Cloudflare released Unweight, a lossless compression system that reduces LLM size by 15-22% without sacrificing output accuracy. The tool, which saves roughly 3 GB of VRAM on Nvidia H100 GPUs for Llama-3.1-8B, has been open-sourced on GitHub with plans to extend compression.

Open Source Optimization GPU compression

NEWS↑ trendingReddit r/MachineLearning·4/22/2026

GPU Compass – open-source, real-time GPU pricing across 20+ clouds [P]

GPU Compass, an open-source tool, has been launched to provide real-time GPU pricing across more than 20 cloud providers. It catalogs 50 GPU models and over 2,000 offerings, including on-demand, spot pricing, and historical trends, making the raw data accessible to everyone.

Open Source cloud computing GPU AI infrastructure

ARTICLE↑ trendingReddit r/LocalLLaMA·4/11/2026

Gemma 4 31B vs Qwen 3.5 27B: Which is best for long context worklows? My THOUGHTS...

The article compares Gemma 4 31B and Qwen 3.5 27B, identifying them as the best models for local use on 24GB GPUs. The author praises Qwen 3.5 27B for its superior reasoning and long-context analysis capabilities without hallucinations, marking a significant evolution.

GPU Gemma 4 31B Long Context Qwen 3.5 27B

RESEARCH↑ trendingReddit r/LocalLLaMA·4/13/2026

MiniMax-M2.7 NVFP4 on 2x RTX PRO 6000 Blackwell — bench numbers

This content presents benchmark results for the MiniMax-M2.7 LLM, utilizing NVFP4 quantization on a dual NVIDIA RTX PRO 6000 Blackwell GPU setup. It details decode throughput at various concurrency levels and prefill performance across different context sizes.

GPU Benchmarking NVIDIA Blackwell MiniMax M2.7

MiniMax-M2.7 NVFP4 on 2x RTX PRO 6000 Blackwell — bench numbers

ARTICLE↑ trendingReddit r/LocalLLaMA·5/6/2026

Analysis of the 100 most popular hardware setups on Hugging Face

This content analyzes the 100 most popular hardware setups utilized on the Hugging Face platform. It offers insights into infrastructure preferences and trends for AI development.

Hugging Face cloud computing GPU AI hardware

Analysis of the 100 most popular hardware setups on Hugging Face

DOC↑ trendingReddit r/LocalLLaMA·4/11/2026

Run Qwen3.5-397B-A13B with vLLM and 8xR9700

This document details the optimized execution of the Qwen3.5-397B-A17B-MXFP4 model using vLLM on RDNA4 GPUs, such as 8xR9700. It provides a Dockerfile with Triton patches and instructions for downloading the model and launching the inference container.

Docker GPU MXFP4 Qwen

RESEARCH↑ trendingReddit r/LocalLLaMA·5/1/2026

nvidia/Gemma-4-26B-A4B-NVFP4

The content confirms the performance of the Gemma-4-26B-A4B-NVFP4 model on an NVIDIA 5090 GPU, detailing 18.8GB VRAM usage and 50k context capability. It also presents benchmark scores for the NVFP4 version compared to full precision across various metrics like GPQA, AIME, and MMLU Pro.

AI models GPU Benchmarking NVIDIA

ARTICLE↑ trendingReddit r/MachineLearning·4/9/2026

Anyone have an S3-compatible store that actually saturates H100s without the AWS egress tax? [R]

Um usuário está treinando modelos de IA em Lambda Labs com um dataset de 40TB no AWS S3, enfrentando altas taxas de egress. Ele busca uma alternativa de armazenamento sem taxas de egress e alta velocidade, ou uma camada de cache NVMe, após problemas de latência com Cloudflare R2 levarem à subutilização da GPU.

cloud storage GPU AI training HPC

CASE↑ trendingReddit r/LocalLLaMA·4/23/2026

Qwen 3.6 is actually useful for vibe-coding, and way cheaper than Claude

The author successfully implemented Qwen 3.6 models (27B and 35B) locally for coding, demonstrating comparable performance to Claude Code. This local setup drastically reduced costs, from an estimated $142 in API calls to less than $4 in electricity over 8 hours.

GPU Claude local inference Cost Savings

Qwen 3.6 is actually useful for vibe-coding, and way cheaper than Claude

ARTICLE↑ trendingReddit r/LocalLLaMA·4/23/2026

An Overnight Stack for Qwen3.6–27B: 85 TPS, 125K Context, Vision — on One RTX 3090 | by Wasif Basharat | Apr, 2026

The title describes an impressive optimization for the Qwen3.6–27B model, achieving 85 TPS and 125K context with vision capabilities on a single RTX 3090. This represents a significant technical feat for efficient LLM deployment.

Optimization multimodal AI GPU large language models

An Overnight Stack for Qwen3.6–27B: 85 TPS, 125K Context, Vision — on One RTX 3090 | by Wasif Basharat | Apr, 2026

RESEARCH↑ trendingReddit r/MachineLearning·5/3/2026

torch-nvenc-compress: GPU NVENC silicon as a PCIe bandwidth multiplier — PCA + pure-ctypes Video Codec SDK wrapper. Parallel-path overlap measured at 67% of theoretical max on a real GEMM + encode workload. [P]

This project introduces the Python library "torch-nvenc-compress," which leverages the GPU's NVENC/NVDEC hardware to compress LLM activations and KV cache, aiming to overcome PCIe bandwidth bottlenecks in multi-GPU setups. It measures a parallel-path overlap at 67% of theoretical max, improving communication between consumer GPUs.

NVENC GPU PCIe compression

torch-nvenc-compress: GPU NVENC silicon as a PCIe bandwidth multiplier — PCA + pure-ctypes Video Codec SDK wrapper. Parallel-path overlap measured at 67% of theoretical max on a real GEMM + encode workload. [P]

ARTICLE↑ trendingReddit r/LocalLLaMA·5/7/2026

Need advice on hardware purchasing decision: RTX 5090 vs. M5 Max 128GB for agentic software development

The user is seeking advice on choosing between an RTX 5090 and an M5 Max 128GB for agentic software development using Qwen3.6 27B locally. The RTX 5090 offers 3x speed, while the M5 Max provides 4x memory, presenting a trade-off between rapid code generation and larger context capacity.

LLMs GPU hardware performance

ARTICLE↑ trendingReddit r/LocalLLaMA·4/24/2026

Hard freakin' decision..Blackwell 96G or Mac Studio 256G

A user sought advice on purchasing high-end AI hardware to run large models like Gemma4s and Qwen3.6s, weighing options between a Blackwell/RTX Pro 6000 96G GPU and a Mac Studio M3 Ultra 256G. They ultimately decided on the Blackwell option, citing its superior token handling capabilities and a favorable deal.

AI applications GPU AI hardware large language models

Hard freakin' decision..Blackwell 96G or Mac Studio 256G

ARTICLEDEV.to AI·4/23/2026

I Built a Local AI VRAM Calculator & GPU Planner (Beta)

The author has launched a new beta tool called "Local AI VRAM Calculator & GPU Planner" to help determine GPU and VRAM requirements for running local LLMs. This tool aims to make hardware tradeoffs visible for different workloads and quantization levels before committing to components.

LLMs GPU VRAM AI tools

NEWS↑ trendingReddit r/LocalLLaMA·4/12/2026

Weekend project with Intel B70s

A user is building a high-end system with Intel Arc B70 GPUs and a Gigabyte B850 AI Top motherboard. The goal is to test the Gemma 4 model in legal RAG applications, utilizing a Hermes agent.

Legal AI GPU RAG AI Model

ARTICLE↑ trendingReddit r/LocalLLaMA·26d ago

The RTX 5000 PRO (48GB) arrived and it is better than I expected.

The author, a novice PC builder, bought an RTX 5000 Pro GPU for local LLM processing, spending $5600 in total. Despite initial struggles with assembly and software setup (Linux, vLLM), they found the GPU's performance better than expected.

local LLM PC Build GPU AI

ARTICLE↑ trendingReddit r/MachineLearning·4/17/2026

Which computer should I buy: Mac or custom-built 5090? [D]

The user seeks advice on choosing between a Mac M5 MAX with MLX and a custom-built PC with an RTX 5090 for their machine learning projects. Their work primarily involves fine-tuning large pre-trained models and training from scratch, often with image/video data and sometimes LLMs, making VRAM a critical factor.

deep learning GPU machine learning hardware

NEWS↑ trendingReddit r/LocalLLaMA·4/9/2026

backend-agnostic tensor parallelism has been merged into llama.cpp

A funcionalidade de paralelismo de tensor backend-agnóstico foi integrada ao llama.cpp, permitindo que modelos de IA rodem muito mais rápido em sistemas com múltiplas GPUs. Isso significa que a aceleração de desempenho não exige mais CUDA.

LLMs Otimização GPU IA

ARTICLE↑ trendingReddit r/LocalLLaMA·25d ago

China modded GPU (eg. 4090 48gb) --> I'm gonna figure it out. IS THERE NO ONE ELSE CURIOUS??

The author expresses a strong interest in understanding Chinese modded GPUs, like a 4090 48GB, highlighting a lack of information in the English-speaking world. They are looking for user experiences regarding their performance, reliability, software quirks, benchmarks, and pricing, especially for AI/LLM applications.

modding China tech GPU AI hardware

ARTICLE↑ trendingReddit r/MachineLearning·4/27/2026

Anyone using Tensordock GPU instances and having problems with failing VM’s [D]

A user reports critical issues with Tensordock GPU instances, where their VM for valuable research has failed to start for two days despite continuous payments. They express extreme frustration over the complete lack of support and the service's unreliability, fearing data loss with unclear compensation.

cloud computing GPU AI infrastructure service-issues