Pruning

4 items

ARTICLE↑ trendingReddit r/MachineLearning·4/23/2026

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

The user is optimizing a Transformer model for size and inference speed, having plateaued after FP16 conversion and ONNX optimization, with pruning yielding limited gains. They are seeking advice on advanced techniques like low-rank factorization, aggressive quantization (INT8/INT4), knowledge distillation, or hardware-specific optimizations to achieve further real-world improvements.

Pruning inference Transformer quantization

RESEARCHDEV.to AI·4/20/2026

O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning

O1-Pruner introduces a length-harmonizing fine-tuning method aimed at improving reasoning capabilities through model pruning. This technique focuses on optimizing models for specific O1-like reasoning tasks.

Pruning Reasoning Fine-tuning model optimization

RESEARCHarXiv CS.LG·4/8/2026

Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression

Este artigo propõe um pipeline ordenado (poda, quantização INT8 e destilação de conhecimento) para otimizar a compressão de redes neurais, visando a latência de inferência medida em vez de métricas indiretas. A pesquisa revela que a quantização INT8 oferece o principal benefício de tempo de execução, enquanto a poda atua como um pré-condicionador e a destilação de conhecimento recupera a precisão.

Pruning Knowledge Distillation model efficiency Neural Network Compression

RESEARCHarXiv CS.CL·5/1/2026

Exploring the Limits of Pruning: Task-Specific Neurons, Model Collapse, and Recovery in Task-Specific Large Language Models

This study explores the existence of task-specific neurons in large language models, focusing on mathematical reasoning and code generation. It introduces an activation-based selectivity metric for neuron pruning, which consistently outperforms random pruning in reducing computational cost and preserving task accuracy, while preventing performance collapse.

Pruning AI optimization model collapse large language models