model optimization

26 items

ARTICLE↑ trendingReddit r/MachineLearning·4/23/2026

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

The user is optimizing a Transformer model for size and inference speed, having plateaued after FP16 conversion and ONNX optimization, with pruning yielding limited gains. They are seeking advice on advanced techniques like low-rank factorization, aggressive quantization (INT8/INT4), knowledge distillation, or hardware-specific optimizations to achieve further real-world improvements.

Pruning inference Transformer quantization

NEWS↑ trendingReddit r/MachineLearning·4/22/2026

INT3 compression+fused metal kernels [R]

A solo founder developed INT3 model compression and a 2-bit KV cache with custom fused Metal kernels for Mac (M-series). Qwen 7B is available in preview, and further optimizations and GPU support are planned.

Hardware Acceleration LLMs quantization model optimization

ARTICLE↑ trendingReddit r/LocalLLaMA·4/22/2026

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent

The author demonstrates that pairing the Qwen3.6-35B model with the "little-coder" agent drastically improves its performance on the Polyglot benchmark to 78.7%, making it competitive with top cloud models. This finding suggests that a "harness mismatch" in testing setups might explain performance gaps between local and cloud AI models.

LLMs coding agents Benchmarking Agent systems

RESEARCH↑ trendingReddit r/LocalLLaMA·4/18/2026

Qwen3.6-35B-A3B-Uncensored-Wasserstein-GGUF

A user discovered and fixed a significant tensor drift issue in the `ssm_conv1d` layers of quantized Qwen3.6-35B GGUF models, proposing the Wasserstein metric as superior to Kullback Leibler for detecting numerical instability. The fix, which specifically targets recurrent state transition layers responsible for long-context memory, is now available in a shared model.

LLMs quantization GGUF model optimization

ARTICLE↑ trendingReddit r/LocalLLaMA·4/16/2026

PSA: Qwen3.6 ships with preserve_thinking. Make sure you have it on.

Qwen 3.6 now ships with a new `preserve_thinking` flag that addresses the KV cache invalidation issue by maintaining the model's full reasoning context. This feature is particularly beneficial for agent scenarios, enhancing decision consistency and optimizing token consumption and KV cache utilization.

large language models model optimization Qwen AI agents

PSA: Qwen3.6 ships with preserve_thinking. Make sure you have it on.

DOC↑ trendingReddit r/MachineLearning·4/22/2026

Need Info on quality benchmarks to run on DeepSeek V3.2 different quant levels [D]

A user is seeking advice on what quality benchmarks to run to measure the performance degradation when applying runtime quantization to the DeepSeek V3.2 large language model. The goal is to compare the quality loss against the non-quantized version.

Benchmarking quantization model optimization AI evaluation

RESEARCH↑ trendingReddit r/MachineLearning·4/16/2026

ResBM: a new transformer-based architecture for low-bandwidth pipeline-parallel training, achieving 128× activation compression [R]

Macrocosmos has introduced ResBM, a new transformer-based architecture for low-bandwidth pipeline-parallel training. It achieves 128x activation compression without significant loss in convergence compared to uncompressed baselines.

distributed training machine learning architecture model optimization Transformers

ARTICLE↑ trendingReddit r/MachineLearning·4/23/2026

2b or not 2b ? Custom LLM Scheduling Competition [P]

A Kaggle competition has been launched, focusing on optimizing token costs for LLM answers by deciding whether to run a small model or skip a question. The goal is to minimize weighted cost, considering compute, failure, and penalty for skipping a correct answer.

Kaggle Benchmarking model optimization resource management

CASE↑ trendingReddit r/MachineLearning·4/27/2026

INT8 quantization gives me better accuracy than FP16 ! [D]

A user observed that INT8 quantization in their deep learning model yielded better inference accuracy than FP16, which was unexpected. They are seeking explanations for INT8's superior performance over FP16.

inference ONNX deep learning quantization

RESEARCHarXiv CS.LG·4/16/2026

Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

This paper presents a necessary condition for intra-group learning algorithm design in Reinforcement Learning, requiring objectives to maintain gradient exchangeability across token updates to prevent reward-irrelevant drift. It proposes minimal transformations to restore this cancellation structure, which stabilizes training and improves sample efficiency.

reinforcement learning large language models gradient dynamics model optimization

RESEARCHDEV.to AI·4/20/2026

O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning

O1-Pruner introduces a length-harmonizing fine-tuning method aimed at improving reasoning capabilities through model pruning. This technique focuses on optimizing models for specific O1-like reasoning tasks.

Pruning Reasoning Fine-tuning model optimization

RESEARCHTogether AI Blog·4/15/2026

Parcae: Doing more with fewer parameters using stable looped models

Parcae is a stable looped language model that matches the quality of a Transformer twice its size, using fewer parameters. It introduces the first scaling laws for looping, demonstrating that increasing recurrence is a compute-efficient path to better performance.

language models deep learning efficiency model optimization

RESEARCHarXiv CS.LG·4/20/2026

Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit

This research introduces sequential KV compression, a novel two-layer architecture for transformer key-value caches that surpasses the per-vector Shannon limit. It leverages the sequential nature of KV cache tokens, using probabilistic prefix deduplication with language tries and predictive delta coding to achieve more efficient compression.

Transformer Architecture AI models LLMs data compression

RESEARCHarXiv CS.CL·4/7/2026

SoLA: Leveraging Soft Activation Sparsity and Low-Rank Decomposition for Large Language Model Compression

SoLA é um novo método de compressão sem treinamento para LLMs, que utiliza esparsidade de ativação suave e decomposição de baixo-rank. Ele identifica componentes cruciais para a inferência e comprime a maioria, visando reduzir parâmetros de modelos de linguagem grandes de forma eficiente e acessível.

Sparsity Low-Rank Decomposition LLM compression large language models

NEWSDEV.to AI·9d ago

Bonsai Image 4B: difusión de 1 bit que corre en un iPhone

PrismML launched Bonsai Image 4B, a family of image generation models using 1-bit or ternary weights to run high-quality diffusion on local devices like iPhones. This innovation results in an 8.3x model compression, reducing it from 7.75 GB to 0.93 GB, while retaining up to 95% of the original quality.

Diffusion Models Edge AI image generation PrismML

RESEARCHDEV.to AI·20d ago

AI/ML Research Digest — May 16, 2026

Recent AI/ML research breakthroughs significantly enhance model efficiency and inference speed across various applications. Techniques like knowledge distillation with low-rank adapters, improved on-policy distillation, the Pion optimizer, and prune-then-distill methods are reducing computational costs and enabling broader deployment of advanced AI models.

deep learning machine learning AI Efficiency video generation

RESEARCHarXiv CS.CL·4/27/2026

An End-to-End Ukrainian RAG for Local Deployment. Optimized Hybrid Search and Lightweight Generation

This paper introduces a highly efficient Retrieval-Augmented Generation (RAG) system specifically for Ukrainian document question answering, which achieved 2nd place in the UNLP 2026 Shared Task. It features a custom hybrid search and a specialized Ukrainian language model, compressed for high-quality, verifiable local deployment on resource-constrained hardware.

Ukrainian language RAG natural language processing Local AI

RESEARCHarXiv CS.LG·5/7/2026

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

This research introduces EdgeRazor, a lightweight framework designed to deploy Large Language Models on resource-constrained devices. It leverages mixed-precision quantization-aware distillation to convert full-precision models into lower-bit formats, overcoming limitations of previous quantization methods.

LLMs deep learning quantization model optimization

RESEARCHDEV.to AI·4/25/2026

PP-LCNet: A Lightweight CPU Convolutional Neural Network

PP-LCNet introduces a lightweight convolutional neural network optimized for efficient performance on CPUs. This architecture focuses on achieving high accuracy while maintaining minimal computational demands, making it suitable for resource-constrained environments.

deep learning lightweight models computer vision Convolutional Neural Networks

RESEARCHarXiv CS.LG·27d ago

LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection

Diffusion Language Models (dLLMs) face scalability limits in parallelism due to overly conservative confidence thresholds that hinder their potential for highly parallel processing. This paper introduces LEAP, a training-free, plug-and-play method that improves dLLM parallelism by detecting early-converging tokens, thereby accelerating decoding.

Diffusion Models Parallel Computing AI large language models