Memory Optimization

12 items

RESEARCH↑ trendingReddit r/MachineLearning·4/20/2026

Open-source single-GPU reproductions of Cartridges and STILL for neural KV-cache compaction [P]

The author implemented and open-sourced reproductions of two recent ideas, Cartridges and STILL, for neural KV-cache compaction and long-context inference. The goal is to make these research ideas easy to inspect and run with benchmark code, also comparing them against existing methods.

neural networks Open Source research Memory Optimization

RESEARCH↑ trendingReddit r/MachineLearning·4/11/2026

What if your HNSW index stored 3-bit embeddings instead of float32? [R]

O texto explora uma abordagem experimental para indexação de vetores HNSW que utiliza embeddings quantizados de 3 bits, em vez de float32, para reduzir o uso de memória. A técnica, baseada em PolarQuant, permite cálculo de distância eficiente via tabelas pré-computadas, resultando em economia de memória e bom recall, apesar de um processo de construção mais lento e desafios com o ruído de quantização.

HNSW Memory Optimization quantization Vector Indexing

RESEARCHarXiv CS.LG·4/28/2026

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

This work addresses the significant memory footprint of Key-Value (KV) caching in transformer language models, proposing optimization through the depth dimension. It introduces a method for cross-layer cache sharing, demonstrating that dropping a layer's cache can be efficient without information loss, and suggests a training approach with random cross-layer attention.

deep learning Memory Optimization large language models Transformers

RESEARCHarXiv CS.LG·29d ago

RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory

This paper introduces RateQuant, a method for optimal mixed-precision KV cache quantization in large language models to address memory bottlenecks. It tackles the challenge of distortion model mismatch, where applying one quantizer's distortion model to another degrades performance compared to uniform quantization.

Memory Optimization quantization AI Research LLM

RESEARCHarXiv CS.CL·4/8/2026

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

MegaTrain é um sistema focado em memória que permite o treinamento eficiente de modelos de linguagem grandes com mais de 100 bilhões de parâmetros em precisão total em uma única GPU. Ele armazena parâmetros na memória do host e utiliza otimizações como um motor de execução pipeline e templates de camada sem estado para superar gargalos de largura de banda e maximizar a utilização da GPU.

Single GPU Training Memory Optimization GPU Acceleration large language models

RESEARCHarXiv CS.CL·4/15/2026

LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

LoSA introduces Locality Aware Sparse Attention to address memory-bound attention and the KV Inflation problem in block-wise diffusion language models, especially for long contexts. It optimizes performance by reusing cached attention for stable tokens and applying sparse attention only to active tokens, significantly reducing KV index loading.

Memory Optimization Long Context KV Inflation sparse attention

ARTICLEDEV.to AI·4/25/2026

DeepSeek V4's Real Innovation Isn't Scale—It's Memory Architecture

DeepSeek V4's true innovation lies in its memory architecture, not just its scale, making its 1M token context practically usable. Through KV cache compression techniques like CSA and HCA, it achieves a nearly 9x memory reduction compared to its predecessor, overcoming practical challenges of long-context models.

AI models LLMs deep learning Memory Optimization

ARTICLEDEV.to AI·5/1/2026

2 Lines of Code Saved 6.4x Memory on My Snake AI

The author describes how a 'direction' channel in their Snake AI's state representation caused a 6.4x memory overhead. Using uint8 for only 2 bits of information prevented efficient bit-packing, leading to 1,600 bytes per state instead of 250.

Memory Optimization Data Representation AI

RESEARCHarXiv CS.LG·4/28/2026

Parameter Efficiency Is Not Memory Efficiency: Rethinking Fine-Tuning for On-Device LLM Adaptation

This research challenges the assumption that Parameter-Efficient Fine-Tuning (PEFT) equates to memory efficiency for on-device LLMs, showing existing methods can still lead to out-of-memory errors. It introduces LARS (Low-memory Activation-Rank Subspace), a novel framework that decouples memory consumption from sequence length by constraining the activation subspace, achieving an average 33.54% memory footprint reduction.

Memory Optimization on-device AI Fine-tuning PEFT

RESEARCHarXiv CS.LG·4/21/2026

BASIS: Balanced Activation Sketching with Invariant Scalars for "Ghost Backpropagation"

This paper introduces BASIS, an efficient backpropagation algorithm designed to mitigate the O(L * BN) spatial memory bottleneck in deep neural networks. It fully decouples activation memory from batch and sequence dimensions, preserving exact error signals while computing weight updates with massively compressed tensors, and addresses gradient instability with novel mechanisms.

neural networks deep learning Memory Optimization backpropagation

RESEARCHarXiv CS.LG·29d ago

LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction

This paper introduces LKV (Learned KV Eviction), a novel approach to optimize Key-Value (KV) cache memory in Large Language Models (LLMs). LKV formulates KV compression as an end-to-end differentiable optimization problem, learning budgets and token selection to overcome limitations of heuristic methods.

deep learning Memory Optimization efficiency KV cache

RESEARCHarXiv CS.LG·4/30/2026

Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective

This work rethinks KV cache eviction for LLMs using an information-theoretic objective derived from the Information Bottleneck principle. It introduces CapKV, a new capacity-aware method that preserves information, outperforming existing heuristic strategies.

Memory Optimization machine learning large language models AI inference