VRAM

10 items

DOC↑ trendingReddit r/LocalLLaMA·19d ago

Latest b9274 Addresses MTP VRAM leak

The b9274 update addresses a VRAM leak issue in MTP (Multi-Token Prediction) models, where GPU-allocated resources were not freed on sleep/resume cycles. The fix involves explicitly resetting speculative decoder, draft context, and draft model resources in the destroy() function to prevent out-of-memory errors.

server MTP VRAM memory leak

ARTICLE↑ trendingReddit r/MachineLearning·4/12/2026

KIV: 1M token context window on a RTX 4070 (12GB VRAM), no retraining, drop-in HuggingFace cache replacement - Works with any model that uses DynamicCache [P]

KIV (K-Indexed V Materialization) is a middleware layer that replaces the standard HuggingFace KV cache with a tiered retrieval system, moving old data to system RAM. This enables 1M token context windows on an RTX 4070 (12GB VRAM) with only 12MB VRAM overhead and good performance.

KIV LLM optimization Context window VRAM

ARTICLE↑ trendingReddit r/LocalLLaMA·4/12/2026

MiniMax-M2.7 vs Qwen3.5-122B-A10B for 96GB VRAM full offload?!

The author compares MiniMax-M2.7 and Qwen3.5-122B-A10B GGUF models for local full offload on a 96GB VRAM rig. For their purposes, Qwen3.5-122B is preferred, despite MiniMax being more quantized, highlighting the trade-offs in performance for local LLM inference.

VRAM GGUF MiniMax Qwen

MiniMax-M2.7 vs Qwen3.5-122B-A10B for 96GB VRAM full offload?!

RESEARCH↑ trendingReddit r/LocalLLaMA·5/6/2026

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

This content compares the quality of various Qwen 3.6 27B model quantizations using a custom chess game test to find the optimal one for 16 GB VRAM setups. It evaluates the models' ability to track board states and generate accurate SVG images of the chessboard.

VRAM Benchmarking quantization model quality

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

ARTICLE↑ trendingReddit r/LocalLLaMA·4/9/2026

16 GB VRAM users, what model do we like best now?

Um usuário com 16 GB de VRAM compartilha sua experiência positiva com o modelo Qwen 3.5 27b em quants IQ3 em uma RTX 4080, alcançando boa velocidade e contexto. Ele discute os desafios de otimizar modelos de IA localmente com essa quantidade de VRAM, ponderando entre qualidade e velocidade ao lidar com diferentes níveis de quantização.

LLMs VRAM modelos de linguagem hardware

ARTICLEDEV.to AI·4/23/2026

I Built a Local AI VRAM Calculator & GPU Planner (Beta)

The author has launched a new beta tool called "Local AI VRAM Calculator & GPU Planner" to help determine GPU and VRAM requirements for running local LLMs. This tool aims to make hardware tradeoffs visible for different workloads and quantization levels before committing to components.

LLMs GPU VRAM AI tools

NEWS↑ trendingReddit r/LocalLLaMA·5/4/2026

Ryzen AI Max+ 495 (Gorgon Halo) with 192GB VRAM!

Leaks indicate that the AMD Ryzen AI Max+ PRO 495 (Gorgon Halo) might feature an APU with 192GB of VRAM, signaling a promising future for Local AI. Despite potential high costs due to the storage crisis, future versions like the Medusa Halo in 2027 are speculated to reach 256GB.

Ryzen AI VRAM AMD Local AI

ARTICLEDEV.to AI·4/8/2026

99.8% of LLM Inference Power Isn't Spent on Computation

O artigo argumenta que o consumo de energia é o maior gargalo na inferência de LLMs, mais do que largura de banda ou VRAM, devido a limitações físicas. Isso se deve ao colapso da Lei de Dennard por volta de 2006, que impediu a redução automática do consumo de energia com o encolhimento dos transistores.

power consumption Bandwidth AI hardware VRAM

RESEARCHarXiv CS.AI·6d ago

AURA: Action-Gated Memory for Robot Policies at Constant VRAM

This paper introduces AURA-Mem, a novel recurrent memory architecture for robots designed to maintain constant VRAM size. Unlike the KV-cache, AURA-Mem employs a learned gate that writes only when the current observation changes the next action, making it suitable for bandwidth-limited edge hardware in embodied agents.

robot policies VRAM embodied AI robotics

ARTICLEDEV.to AI·4/10/2026

i generated AI video on a GTX 1660. here's what it actually takes.

O artigo detalha o FramePack F1, uma ferramenta inovadora que permite gerar vídeos a partir de uma única imagem utilizando apenas 6 GB de VRAM, tornando-a acessível em GPUs comuns como a GTX 1660. Ele descreve a arquitetura de pipeline com cinco componentes, sublinhando a praticidade e a usabilidade local da solução para projetos reais.

GPU VRAM IA FramePack