quantization

57 items

RESEARCHDEV.to AI·24d ago

Efficient 8-Bit Quantization of Transformer Neural Machine Language TranslationModel

This paper discusses efficient 8-bit quantization for Transformer neural machine language translation models. The goal is to optimize the performance and efficiency of these models by reducing memory consumption and latency.

AI models efficiency NLP quantization

DOCDEV.to AI·26d ago

How to Deploy Qwen2.5 32B with vLLM + Quantization on a $12/Month DigitalOcean GPU Droplet: Production-Grade Inference at 1/100th Claude Cost

This content details how to deploy the Qwen2.5 32B language model using vLLM and quantization on a $12/month DigitalOcean GPU droplet. It demonstrates production-grade inference at a significantly lower cost than commercial APIs.

deployment quantization Cost Optimization vLLM

ARTICLEDEV.to AI·4/18/2026

Traditional Quantization vs 1.58-Bit Ternary Models: A Practical Comparison

The article compares traditional quantization methods (like INT4/INT8) used for local LLMs with the emerging 1.58-bit ternary quantization approach found in projects like BitNet b1.58. It highlights the simplicity of ternary models, which use only -1, 0, or +1 for weights, contrasting them with standard post-training quantization techniques.

Model Compression LLMs AI optimization quantization

ARTICLEDEV.to AI·29d ago

When I started running models locally, I thought quantization meant squeezing more into RAM. Turns o

The article advises against defaulting to Q4_K_M for local LLM inference, emphasizing that optimal performance comes from testing quantization levels tailored to specific workflows. It suggests that aggressive quantization like Q3_K_S can significantly cut latency with imperceptible quality loss for many tasks, though context length presents a trade-off.

Optimization LLMs quantization hardware

RESEARCHarXiv CS.LG·4/24/2026

FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels

FairyFuse is a new inference system designed for CPU-only platforms, enabling multiplication-free execution of large language models. It uses ternary weights ({-1, 0, +1}) to replace floating-point multiplications with conditional additions and subtractions, significantly reducing memory bandwidth bottlenecks and offering up to 16x weight compression.

inference CPU optimization quantization performance

RESEARCHarXiv CS.LG·28d ago

Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant

This research analyzes three KV cache quantization schemes (KV, KQV, QKQV) and their impact on inner product variance, especially how QJL on K inflates it, amplified by softmax. Empirical findings highlight KQV's superior performance at a budget of n=4, an unconditional K-V asymmetry where QKQV is consistently worse than KQV in KL divergence, and budget-dependent crossovers for geometric K reconstruction.

machine learning quantization AI statistical inference

RESEARCHarXiv CS.LG·5/7/2026

MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning

This research introduces MP-ISMoE, a Mixed-Precision Interactive Side Mixture-of-Experts framework, to enhance parameter-efficient transfer learning by mitigating memory overhead. It employs a Gaussian Noise Perturbed Iterative Quantization (GNP-IQ) scheme for lower-bit weight quantization, freeing up memory to improve side network learning capacity and performance.

model efficiency learning Transfer Learning quantization

RESEARCHarXiv CS.LG·5/7/2026

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

This research introduces EdgeRazor, a lightweight framework designed to deploy Large Language Models on resource-constrained devices. It leverages mixed-precision quantization-aware distillation to convert full-precision models into lower-bit formats, overcoming limitations of previous quantization methods.

LLMs deep learning quantization model optimization

RESEARCHarXiv CS.LG·20d ago

Theory-optimal Quantization Based on Flatness

This research models the relationship between quantization error and outliers in Large Language Models (LLMs) and introduces a new metric, Flatness, to quantify outlier distribution. Based on this, it derives a theoretical optimal solution and proposes Bidirectional Diagonal Quantization (BDQ) for post-training quantization.

deep learning machine learning quantization AI

RESEARCHarXiv CS.LG·27d ago

QuIDE: Mastering the Quantized Intelligence Trade-off via Active Optimization

QuIDE introduces a unified metric, the Intelligence Index I, to evaluate the efficiency of quantized neural networks by collapsing the compression-accuracy-latency trade-off. Experiments across various settings identify task-dependent optimal quantization (4-bit or 8-bit), providing a reproducible evaluation protocol and a fitness function for mixed-precision search.

neural networks Optimization machine learning AI Efficiency

RESEARCHarXiv CS.LG·22d ago

Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels

This study investigates the impact of post-training quantization on Large Language Models (LLMs) quality, revealing that compression can lead to bias emergence. 3-bit quantization caused 6-21% of previously unbiased items to develop new stereotypical behaviors in models like Qwen2.5-7B, Mistral-7B, and Phi-3.5-mini. This follows a clear dose-response pattern across various precision levels.

Model Compression LLMs quantization model quality

ARTICLEDEV.to AI·5/8/2026

The Mobile Architect: Bridging the AI Gap Without a PC

The author shares their experience coding on a smartphone, realizing that AI development can happen anywhere. The Gemma 4 E2B model is a game-changer, enabling AI to run efficiently on mobile devices with low RAM consumption, democratizing access for students and developers.

mobile development Edge AI Gemma 4 AI on Mobile

DOCDEV.to AI·14d ago

How to Deploy Llama 3.2 90B with vLLM + Quantization on a $20/Month DigitalOcean GPU Droplet: Enterprise Reasoning at 1/140th Claude Opus Cost

This content provides a guide on deploying the Llama 3.2 90B model using vLLM and quantization on a DigitalOcean GPU droplet, costing only $20/month. This setup offers enterprise-grade reasoning capabilities at a cost 25 times lower than Claude Opus, achieving significant cost savings for AI infrastructure.

AI deployment quantization Cost Optimization DigitalOcean

ARTICLEOpenAI Blog·29d ago

What Parameter Golf taught us about AI-assisted research

Parameter Golf brought together over 1,000 participants and 2,000 submissions to explore AI-assisted machine learning research. The event focused on coding agents, quantization, and novel model design under strict constraints.

research machine learning quantization AI

NEWSML Mastery·4/30/2026

Effective KV Compression with TurboQuant

Google recently launched TurboQuant, a novel algorithmic suite and library for applying advanced quantization and compression to large language models (LLMs) and vector search engines. This tool is an indispensable element of RAG systems.

LLMs quantization vector search RAG systems

Effective KV Compression with TurboQuant

ARTICLEDEV.to AI·4/14/2026

Best Open-Source Models for OpenClaw — Run Locally, No API Costs

This article recommends the best open-source AI models for local execution on OpenClaw in April 2026, highlighting Qwen3.5:27b as the best all-rounder, DeepSeek-R1-Distill-32B for coding, and Llama 4 Scout for multimodal tasks. It details VRAM requirements and benchmark performance for each model.

open source models LLMs GPU local inference

NEWSDEV.to AI·4/14/2026

Autonomous Sovereign AI Nodes: v10082 Deployment Log

This is a deployment log for Autonomous Sovereign AI Nodes v10082 under the FractalMesh Omega Titan project. It details full edge-quantization on Termux hardware, managed by Samuel James Hiotis.

deployment Edge AI Autonomous systems quantization