Quantization of outlier free quantizable language models

Khan S., Kabir A., Lukmanov R.A.

UDC 004.032.26
DOI: 10.26102/2310-6018/2026.55.4.005

Abstract
List of references
About authors

As deep learning models including the LLMs become a part of our daily lives, they continue to require more and more computational cost. The heavy models need a lot of processing power to train and even to make inferences. However, we can reduce this cost by compression techniques such as quantization. Standard quantization of some transformer models comes at the risk of presence of outliers that result in inaccurate results. In this study, we develop a hybrid model which involves using clipped softmax in attention heads of the model during training to mitigate outliers and then applying activations aware weights only quantization on trained model which helps in reducing quantization error by scaling the weights before quantization. We show that our approach results in better handling of outliers, hinted by reduced kurtosis in clipped softmax trained quantized models as compared to vanilla trained quantized models. Overall, our hybrid method not only achieves the best final model performance but does so by effectively suppressing outliers by a factor of 5–7x across key metrics, making the model far more robust to the quantization process.

1. Li P., Yang J., Islam M.A., Ren Sh. Making AI Less "Thirsty": Uncovering and Addressing the Secret Water Footprint of AI Models. arXiv. URL: https://arxiv.org/abs/2304.03271 [Accessed 18th August 2025].

2. Gholami A., Kim S., Dong Zh., et al. A Survey of Quantization Methods for Efficient Neural Network Inference. arXiv. URL: https://arxiv.org/abs/2103.13630 [Accessed 18th August 2025].

3. Dettmers T., Lewis M., Belkada Y., Zettlemoyer L. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv. URL: https://arxiv.org/abs/2208.07339 [Accessed 18th August 2025].

4. Xiao G., Lin J., Seznec M., et al. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. arXiv. URL: https://arxiv.org/abs/2211.10438 [Accessed 18th August 2025].

5. Bondarenko Y., Nagel M., Blankevoort T. Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing. arXiv. URL: https://arxiv.org/abs/2306.12929 [Accessed 18th August 2025].

6. Khan S.A., Shulepina S., Shulepin D., Lukmanov R.A. Review of Algorithmic Solutions for Deployment of Neural Networks on Lite Devices. Computer Research and Modeling. 2024;16(7):1601–1619. https://doi.org/10.20537/2076-7633-2024-16-7-1601-1619

7. Krishnamoorthi R. Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper. arXiv. URL: https://arxiv.org/abs/1806.08342 [Accessed 24th August 2025].

8. Dumitru R.-G., Yadav V., Maheshwary R., et al. Layer-wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-levels. arXiv. URL: https://arxiv.org/abs/2406.17415 [Accessed 24th August 2025].

9. Dai S., Venkatesan R., Ren H., et al. VS-Quant: Per-Vector Scaled Quantization for Accurate Low-Precision Neural Network Inference. arXiv. URL: https://arxiv.org/abs/2102.04503 [Accessed 24th August 2025].

10. Nagel M., van Baalen M., Blankevoort T., Welling M. Data-Free Quantization through Weight Equalization and Bias Correction. arXiv. URL: https://arxiv.org/abs/1906.04721 [Accessed 28th August 2025].

11. Guo M., Dai Z., Vrandečić D., Al-Rfou R. Wiki-40B: Multilingual Language Model Dataset. In: Proceedings of the 12th Language Resources and Evaluation Conference, LREC 2020, 11–16 May 2020, Marseille, France. European Language Resources Association; 2020. P. 2440–2452.

12. Zhu Y., Kiros R., Zemel R., et al. Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), 07–13 December 2015, Santiago, Chile. IEEE; 2015. P. 19–27. https://doi.org/10.1109/ICCV.2015.11

13. Lin J., Tang J., Tang H., et al. AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration. arXiv. URL: https://arxiv.org/abs/2306.00978 [Accessed 24th August 2025].

14. Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv. URL: http://arxiv.org/abs/1810.04805 [Accessed 24th August 2025].

Khan Sameed Ahmed

Innopolis University

Innopolis, Russian Federation

Kabir A. S. M. Humaun

Email: humaun.kabir@phystech.edu

Moscow Institute of Physics and Technology

Moscow, Russian Federation

Lukmanov Rustam Abubakirovich

Innopolis University

Innopolis, Russian Federation

Keywords: quantization, outlier, perplexity, attention, softmax, kurtosis

Sources of funding: This work was supported by the Academy of Sciences of the Republic of Tatarstan under grant agreement No. 254/2024-PD.

For citation: Khan S., Kabir A., Lukmanov R.A. Quantization of outlier free quantizable language models. Modeling, Optimization and Information Technology. 2026;14(4). URL: https://moitvivt.ru/ru/journal/article?id=2082 DOI: 10.26102/2310-6018/2026.55.4.005 .

Full text in PDF

Скачать JATS XML

Received 09.02.2026

Revised 18.03.2026

Accepted 10.04.2026