← heapsort-ai

Pruning

4 items

ARTICLE↑ trendingReddit r/MachineLearning·4/23/2026

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

The user is optimizing a Transformer model for size and inference speed, having plateaued after FP16 conversion and ONNX optimization, with pruning yielding limited gains. They are seeking advice on advanced techniques like low-rank factorization, aggressive quantization (INT8/INT4), knowledge distillation, or hardware-specific optimizations to achieve further real-world improvements.

50
RESEARCHarXiv CS.LG·4/8/2026

Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression

Este artigo propõe um pipeline ordenado (poda, quantização INT8 e destilação de conhecimento) para otimizar a compressão de redes neurais, visando a latência de inferência medida em vez de métricas indiretas. A pesquisa revela que a quantização INT8 oferece o principal benefício de tempo de execução, enquanto a poda atua como um pré-condicionador e a destilação de conhecimento recupera a precisão.

28
RESEARCHarXiv CS.CL·5/1/2026

Exploring the Limits of Pruning: Task-Specific Neurons, Model Collapse, and Recovery in Task-Specific Large Language Models

This study explores the existence of task-specific neurons in large language models, focusing on mathematical reasoning and code generation. It introduces an activation-based selectivity metric for neuron pruning, which consistently outperforms random pruning in reducing computational cost and preserving task accuracy, while preventing performance collapse.

27