References

moitvivt

Моделирование, оптимизация и информационные технологии

Modeling, Optimization and Information Technology

2310-6018

Издательство

10.26102/2310-6018/2026.55.4.005

2082

Квантование языковых моделей без выбросов

Quantization of outlier free quantizable language models

Хан

Самид Ахмед

Khan

Sameed Ahmed

sameedkhandurrani@gmail.com aff-1

Кабир

А. С. М. Хумаюн

Kabir

A. S. M. Humaun

humaun.kabir@phystech.edu aff-2

Лукманов

Рустам Абубакирович

Lukmanov

Rustam Abubakirovich

r.lukmanov@innopolis.ru aff-3

Университет Иннополис Innopolis University

Московский физико-технический институт Moscow Institute of Physics and Technology

Университет Иннополис Innopolis University

01 01 2026

1 1

10.26102/2310-6018/2026.55.4.005

2026

This work is licensed under a Creative Commons Attribution 4.0 International License

По мере того, как модели глубокого обучения, включая большие языковые модели (LLM), становятся частью нашей повседневной жизни, они требуют все больших вычислительных ресурсов. Тяжелые модели нуждаются в значительной вычислительной мощности как для обучения, так и для выполнения выводов. Однако эту нагрузку можно снизить с помощью методов сжатия, таких как квантование. Стандартное квантование некоторых моделей трансформеров сопряжено с риском появления выбросов, что приводит к неточным результатам. В данном исследовании разрабатывается гибридная модель, которая включает использование усеченного софтмакса в модулях внимания модели во время обучения для смягчения влияния выбросов, а затем применение квантования только весов с учетом активаций на обученной модели. Это помогает снизить ошибку квантования за счет масштабирования весов перед квантованием. Показано, что предлагаемый подход позволяет лучше справляться с выбросами, о чем свидетельствует уменьшение куртоза у моделей с квантованием, обученных с усеченным софтмаксом, по сравнению с моделями, обученными стандартным способом. В целом, гибридная методика не только обеспечивает наилучшую итоговую производительность модели (наименьшую перплексию), но и эффективно подавляет выбросы в 5–7 раз по ключевым метрикам, делая модель значительно более устойчивой к процессу квантования.

As deep learning models including the LLMs become a part of our daily lives, they continue to require more and more computational cost. The heavy models need a lot of processing power to train and even to make inferences. However, we can reduce this cost by compression techniques such as quantization. Standard quantization of some transformer models comes at the risk of presence of outliers that result in inaccurate results. In this study, we develop a hybrid model which involves using clipped softmax in attention heads of the model during training to mitigate outliers and then applying activations aware weights only quantization on trained model which helps in reducing quantization error by scaling the weights before quantization. We show that our approach results in better handling of outliers, hinted by reduced kurtosis in clipped softmax trained quantized models as compared to vanilla trained quantized models. Overall, our hybrid method not only achieves the best final model performance but does so by effectively suppressing outliers by a factor of 5–7x across key metrics, making the model far more robust to the quantization process.

квантование выброс перплексия внимание софтмакс куртозис

quantization outlier perplexity attention softmax kurtosis

Данная работа была поддержана Академией наук Республики Татарстан в рамках гранта № 254/2024-PD.

This work was supported by the Academy of Sciences of the Republic of Tatarstan under grant agreement No. 254/2024-PD.

References 1

Li P., Yang J., Islam M.A., Ren Sh. Making AI Less "Thirsty": Uncovering and Addressing the Secret Water Footprint of AI Models. arXiv. URL: https://arxiv.org/abs/2304.03271 [Accessed 18th August 2025].

Gholami A., Kim S., Dong Zh., et al. A Survey of Quantization Methods for Efficient Neural Network Inference. arXiv. URL: https://arxiv.org/abs/2103.13630 [Accessed 18th August 2025].

Dettmers T., Lewis M., Belkada Y., Zettlemoyer L. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv. URL: https://arxiv.org/abs/2208.07339 [Accessed 18th August 2025].

Xiao G., Lin J., Seznec M., et al. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. arXiv. URL: https://arxiv.org/abs/2211.10438 [Accessed 18th August 2025].

Bondarenko Y., Nagel M., Blankevoort T. Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing. arXiv. URL: https://arxiv.org/abs/2306.12929 [Accessed 18th August 2025].

Кхан С.А., Шулепина С., Шулепин Д., Лукманов Р.А. Обзор алгоритмических решений для развертывания нейронных сетей на легких устройствах. Компьютерные исследования и моделирование. 2024;16(7):1601–1619. (На англ.). https://doi.org/10.20537/2076-7633-2024-16-7-1601-1619

Krishnamoorthi R. Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper. arXiv. URL: https://arxiv.org/abs/1806.08342 [Accessed 24th August 2025].

Dumitru R.-G., Yadav V., Maheshwary R., et al. Layer-wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-levels. arXiv. URL: https://arxiv.org/abs/2406.17415 [Accessed 24th August 2025].

Dai S., Venkatesan R., Ren H., et al. VS-Quant: Per-Vector Scaled Quantization for Accurate Low-Precision Neural Network Inference. arXiv. URL: https://arxiv.org/abs/2102.04503 [Accessed 24th August 2025].

Nagel M., van Baalen M., Blankevoort T., Welling M. Data-Free Quantization through Weight Equalization and Bias Correction. arXiv. URL: https://arxiv.org/abs/1906.04721 [Accessed 28th August 2025].

Guo M., Dai Z., Vrandečić D., Al-Rfou R. Wiki-40B: Multilingual Language Model Dataset. In: Proceedings of the 12th Language Resources and Evaluation Conference, LREC 2020, 11–16 May 2020, Marseille, France. European Language Resources Association; 2020. P. 2440–2452.

Zhu Y., Kiros R., Zemel R., et al. Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), 07–13 December 2015, Santiago, Chile. IEEE; 2015. P. 19–27. https://doi.org/10.1109/ICCV.2015.11

Lin J., Tang J., Tang H., et al. AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration. arXiv. URL: https://arxiv.org/abs/2306.00978 [Accessed 24th August 2025].

Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv. URL: http://arxiv.org/abs/1810.04805 [Accessed 24th August 2025].

The authors declare that there are no conflicts of interest present.