compression

9 items

NEWS↑ trendingReddit r/LocalLLaMA·4/18/2026

Cloudflare open-sources lossless LLM compression tool

Cloudflare released Unweight, a lossless compression system that reduces LLM size by 15-22% without sacrificing output accuracy. The tool, which saves roughly 3 GB of VRAM on Nvidia H100 GPUs for Llama-3.1-8B, has been open-sourced on GitHub with plans to extend compression.

Open Source Optimization GPU compression

RESEARCH↑ trendingReddit r/MachineLearning·5/3/2026

torch-nvenc-compress: GPU NVENC silicon as a PCIe bandwidth multiplier — PCA + pure-ctypes Video Codec SDK wrapper. Parallel-path overlap measured at 67% of theoretical max on a real GEMM + encode workload. [P]

This project introduces the Python library "torch-nvenc-compress," which leverages the GPU's NVENC/NVDEC hardware to compress LLM activations and KV cache, aiming to overcome PCIe bandwidth bottlenecks in multi-GPU setups. It measures a parallel-path overlap at 67% of theoretical max, improving communication between consumer GPUs.

NVENC GPU PCIe compression

torch-nvenc-compress: GPU NVENC silicon as a PCIe bandwidth multiplier — PCA + pure-ctypes Video Codec SDK wrapper. Parallel-path overlap measured at 67% of theoretical max on a real GEMM + encode workload. [P]

ARTICLE3Blue1Brown (YouTube)·2d ago

Reinventing Entropy | Compression & Intelligence Part 1

This article explores the relationship between entropy, compression, and intelligence, serving as the first part of a series. It aims to redefine the understanding of these fundamental concepts.

information theory intelligence AI compression

Reinventing Entropy | Compression & Intelligence Part 1

RESEARCHarXiv CS.CL·4d ago

Generic Triple-Latent Compression with Gated Associative Retrieval

This research introduces generic triple-latent sequence models, which use a running token state and compressed pair-memory to capture higher-order token interactions. These models show improvement over a Transformer baseline on language-model benchmarks, though a retrieval extension enhances recall but is slower.

language models latent models sequence models associative retrieval

RESEARCHDEV.to AI·4/26/2026

FIDT as a Domain-Specific Generator: A Honest Reframing of Fujimoto Infinite Dot Theory (Paper 140)

This article reframes the Fujimoto Infinite Dot Theory (FIDT) from a universal codec to a domain-specific generator for D-FUMT₈ theories. This new positioning, developed with Claude Opus 4.7's collaboration, achieves byte-exact reconstruction and high compression.

information theory research large language models compression

RESEARCHarXiv CS.LG·20d ago

Robust Basis Spline Decoupling for the Compression of Transformer Models

This work introduces a B-spline-based decoupling framework for compressing Transformer models. It generalizes existing tensor-based methods, addressing their limitations in numerical instability or limited expressiveness by exploiting the properties of B-splines.

neural networks machine learning AI compression

RESEARCHarXiv CS.LG·4/6/2026

Haiku to Opus in Just 10 bits: LLMs Unlock Massive Compression Gains

Este estudo explora a compressão de texto gerado por LLMs em regimes com e sem perdas, apresentando métodos que melhoram a eficiência em 2x, como adaptadores LoRA e reescritas concisas. Introduz também a compressão interativa por Perguntas e Respostas (QA), um protocolo que transfere um bit por resposta para recuperar uma parte significativa da capacidade de modelos maiores.

lossy compression LLMs arithmetic coding compute frontier

NEWSML Mastery·4/30/2026

Effective KV Compression with TurboQuant

Google recently launched TurboQuant, a novel algorithmic suite and library for applying advanced quantization and compression to large language models (LLMs) and vector search engines. This tool is an indispensable element of RAG systems.

LLMs quantization vector search RAG systems

Effective KV Compression with TurboQuant

ARTICLEDEV.to AI·24d ago

High-Quality, Low-Delay Music Coding in the Opus Codec

This content discusses the Opus codec, highlighting its capabilities in providing high-quality music coding with low delay. It focuses on the technical aspects that enable efficient and performant audio compression.

low-latency audio coding compression digital audio