Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]
The user is optimizing a Transformer model for size and inference speed, having plateaued after FP16 conversion and ONNX optimization, with pruning yielding limited gains. They are seeking advice on advanced techniques like low-rank factorization, aggressive quantization (INT8/INT4), knowledge distillation, or hardware-specific optimizations to achieve further real-world improvements.
