← heapsort-ai

Memory Optimization

12 items

RESEARCH↑ trendingReddit r/MachineLearning·4/11/2026

What if your HNSW index stored 3-bit embeddings instead of float32? [R]

O texto explora uma abordagem experimental para indexação de vetores HNSW que utiliza embeddings quantizados de 3 bits, em vez de float32, para reduzir o uso de memória. A técnica, baseada em PolarQuant, permite cálculo de distância eficiente via tabelas pré-computadas, resultando em economia de memória e bom recall, apesar de um processo de construção mais lento e desafios com o ruído de quantização.

42
RESEARCHarXiv CS.LG·4/28/2026

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

This work addresses the significant memory footprint of Key-Value (KV) caching in transformer language models, proposing optimization through the depth dimension. It introduces a method for cross-layer cache sharing, demonstrating that dropping a layer's cache can be efficient without information loss, and suggests a training approach with random cross-layer attention.

29
RESEARCHarXiv CS.CL·4/8/2026

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

MegaTrain é um sistema focado em memória que permite o treinamento eficiente de modelos de linguagem grandes com mais de 100 bilhões de parâmetros em precisão total em uma única GPU. Ele armazena parâmetros na memória do host e utiliza otimizações como um motor de execução pipeline e templates de camada sem estado para superar gargalos de largura de banda e maximizar a utilização da GPU.

29
RESEARCHarXiv CS.CL·4/15/2026

LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

LoSA introduces Locality Aware Sparse Attention to address memory-bound attention and the KV Inflation problem in block-wise diffusion language models, especially for long contexts. It optimizes performance by reusing cached attention for stable tokens and applying sparse attention only to active tokens, significantly reducing KV index loading.

27
RESEARCHarXiv CS.LG·4/28/2026

Parameter Efficiency Is Not Memory Efficiency: Rethinking Fine-Tuning for On-Device LLM Adaptation

This research challenges the assumption that Parameter-Efficient Fine-Tuning (PEFT) equates to memory efficiency for on-device LLMs, showing existing methods can still lead to out-of-memory errors. It introduces LARS (Low-memory Activation-Rank Subspace), a novel framework that decouples memory consumption from sequence length by constraining the activation subspace, achieving an average 33.54% memory footprint reduction.

27
RESEARCHarXiv CS.LG·4/21/2026

BASIS: Balanced Activation Sketching with Invariant Scalars for "Ghost Backpropagation"

This paper introduces BASIS, an efficient backpropagation algorithm designed to mitigate the O(L * BN) spatial memory bottleneck in deep neural networks. It fully decouples activation memory from batch and sequence dimensions, preserving exact error signals while computing weight updates with massively compressed tensors, and addresses gradient instability with novel mechanisms.

27