Model Compression

8 items

NEWS↑ trendingReddit r/LocalLLaMA·4/17/2026

Ternary Bonsai: Top intelligence at 1.58 bits

Prism ML announced Ternary Bonsai, a new family of 1.58-bit language models designed to balance strict memory constraints with high accuracy. These models, available in 8B, 4B, and 1.7B sizes, achieve a 9x smaller memory footprint than 16-bit models while outperforming most peers.

Model Compression language models Efficient AI

Ternary Bonsai: Top intelligence at 1.58 bits

ARTICLE↑ trendingReddit r/LocalLLaMA·4/14/2026

How to Distill from 100B+ to <4B Models

This content discusses the process of AI model distillation, focusing on how to reduce massive models with over 100 billion parameters to significantly smaller versions with less than 4 billion. The aim is to enhance the efficiency and accessibility of complex AI models.

Model Compression LLMs Model Distillation AI Efficiency

RESEARCHarXiv CS.CL·4/17/2026

Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models

This paper proposes a unified compressed-sensing-guided framework for dynamic LLM execution, addressing the massive parameter counts, memory use, and decoding latency of large language models. It integrates model and prompt compression by using random measurement operators and sparse recovery to estimate task-conditioned and token-adaptive support sets.

Model Compression LLM optimization sparse recovery compressed sensing

RESEARCHarXiv CS.LG·5d ago

LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection

LiftQuant is a novel framework for continuous bit-width control in Large Language Models, addressing limitations of integer-based quantization. It employs a "lift-then-project" mechanism to achieve quasi-continuous bit-width tuning for optimal deployment.

Model Compression neural networks LLMs deep learning

RESEARCHarXiv CS.CL·27d ago

ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

ReAD proposes a Reinforcement-guided Capability Distillation framework for Large Language Models, aiming to compress LLMs while preserving essential abilities for downstream tasks. It explicitly accounts for the interdependence of capabilities, optimizing token budget usage and mitigating degradation of useful abilities.

Model Compression Knowledge Distillation LLMs reinforcement learning

ARTICLEDEV.to AI·4/18/2026

Traditional Quantization vs 1.58-Bit Ternary Models: A Practical Comparison

The article compares traditional quantization methods (like INT4/INT8) used for local LLMs with the emerging 1.58-bit ternary quantization approach found in projects like BitNet b1.58. It highlights the simplicity of ternary models, which use only -1, 0, or +1 for weights, contrasting them with standard post-training quantization techniques.

Model Compression LLMs AI optimization quantization

RESEARCHarXiv CS.LG·22d ago

Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels

This study investigates the impact of post-training quantization on Large Language Models (LLMs) quality, revealing that compression can lead to bias emergence. 3-bit quantization caused 6-21% of previously unbiased items to develop new stereotypical behaviors in models like Qwen2.5-7B, Mistral-7B, and Phi-3.5-mini. This follows a clear dose-response pattern across various precision levels.

Model Compression LLMs quantization model quality

NEWSDEV.to AI·16d ago

ModelBest Drops BitCPM-CANN: First 1.58-bit LLM on Ascend 910B

ModelBest has released BitCPM-CANN, the first 1.58-bit ternary LLM trained end-to-end on Ascend 910B NPUs. This model uses 6x less VRAM than BF16 while retaining most capabilities and is available in four open-source sizes.

Model Compression open-source AI AI hardware BitNet