Using simultaneous multithreading in high-performance numerical algorithms

Buevich E.A.

UDC 519.6
DOI: 10.26102/2310-6018/2024.45.2.041

Abstract
List of references
About authors

The technology of simultaneous multithreading is considered to be of little use in programs involved in intensive calculations, in particular when multiplying matrices - one of the main operations of machine learning. The purpose of this work is to determine the limits of applicability of this type of multithreading to high performance numerical code using the example of block matrix multiplication. The paper highlights a number of characteristics of matrix multiplication code and processor architecture that affect the efficiency of using simultaneous multithreading. A method is proposed for determining the presence of structural limitations of the processor when executing more than one thread and their quantitative estimation. The influence of the used synchronization primitive and its features in relation to simultaneous multithreading are considered. The existing algorithm for dividing matrices into blocks is considered, and it is proposed to change the size of blocks and loop parameters for better utilization of the computing modules of the processor core by two threads. A model has been created to evaluate the performance of executing identical code by two threads on one physical core. A criteria has been created to determine whether computationally intensive code can be optimized using this type of multithreading. It is shown that dividing calculations between logical threads using a common L1 cache is beneficial in at least one of the common processor architectures.

1. Tullsen D.M., Eggers S.J., Levy H.M. Simultaneous multithreading: maximizing on-chip parallelism. In: ISCA '95: Proceedings of the 22nd annual international symposium on Computer architecture, 22-24 June 1995, Santa Margherita Ligure, Italy. New York: Association for Computing Machinery; 1995. P. 392–403. https://doi.org/10.1145/223982.224449

2. Marr D.T., Binns F., Hill D.L., Hinton G., Koufaty D.A., Miller J.A., Upton M. Hyper-Threading Technology Architecture and Microarchitecture. Intel Technology Journal. 2002;6(1):4–15.

3. Leng T., Ali R., Hsieh J., Mashayekhi V., Rooholamini R. An Empirical Study of Hyper-Threading in High Performance Computing Clusters. In: Proceedings of The 3rd LCI International Conference on Linux Clusters: The HPC Revolution 2002, 23-25 October 2002, Saint Petersburg, FL, USA. Linux Clusters Institute; 2002.

4. Smith T.M., Van De Geijn R., Smelyanskiy M., Hammond J.R., Van Zee F.G. Anatomy of High-Performance Many-Threaded Matrix Multiplication. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 19-23 May 2014, Phoenix, AZ, USA. IEEE Computer Society; 2024. P. 1049–1059. https://doi.org/10.1109/IPDPS.2014.110

5. Xu R.G., Van Zee F.G., Van De Geijn R.A. GEMMFIP: Unifying GEMM in BLIS. URL: https://doi.org/10.48550/arXiv.2302.08417 (Accessed 15th April 2024).

6. Goto K., Van De Geijn R.A. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software. 2008;34(3). https://doi.org/10.1145/1356052.1356053

7. Van Zee F.G., Van De Geijn R.A. BLIS: A Framework for Rapidly Instantiating BLAS Functionality. ACM Transactions on Mathematical Software. 2015;41(3). https://doi.org/10.1145/2764454

8. Huang J., Van De Geijn R.A. BLISlab: A Sandbox for Optimizing GEMM. URL: https://doi.org/10.48550/arXiv.1609.00076 (Accessed 15th April 2024)

9. Low T.M., Igual F.D., Smith T.M., Quintana-Orti E.S. Analytical Modeling Is Enough for High-Performance BLIS. ACM Transactions on Mathematical Software. 2016;43(2). https://doi.org/10.1145/2925987

10. Fog A. The Microarchitecture of Intel, AMD and VIA CPUs. URL: https://www.agner.org/optimize/microarchitecture.pdf (Accessed 17th April 2024).

11. Ren X., Moody L., Taram M., Jordan M., Tullsen D.M., Venkat A. I See Dead µops: Leaking Secrets via Intel/AMD Micro-Op Caches. In: 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 14-18 June 2021, Valencia, Spain. IEEE; 2021. P. 361–374. https://doi.org/10.1109/ISCA52012.2021.00036

12. Kim D., Liao S.S.-W., Wang P.H., Del Cuvillo J., Tian X., Zou X., Wang H., Yeung D., Girkar M., Shen J.P. Physical experimentation with prefetching helper threads on Intel's hyper-threaded processors. In: International Symposium on Code Generation and Optimization (CGO 2004), 20-24 March 2004, San Jose, CA, USA. IEEE; 2004. P. 27–38. https://doi.org/10.1109/CGO.2004.1281661

13. Tullsen D.M., Lo J.L., Eggers S.J., Levy H.M. Supporting fine-grained synchronization on a simultaneous multithreading processor. In: HPCA '99: Proceedings of the 5th International Symposium on High-Performance Computer Architecture, 09-13 January 1999, Orlando, FL, USA. NW Washington: IEEE Computer Society; 1999. P. 54–58. https://doi.org/10.1109/HPCA.1999.744326

Buevich Evgeniy Andreevich

Moscow State Technological University "STANKIN"

Moscow, Russian Federation

Keywords: simultaneous multithreading, matrix multiplication, computation intensive, microcore, BLAS, BLIS, synchronization, cache hierarchy, spinlock

For citation: Buevich E.A. Using simultaneous multithreading in high-performance numerical algorithms. Modeling, Optimization and Information Technology. 2024;12(2). URL: https://moitvivt.ru/ru/journal/pdf?id=1588 DOI: 10.26102/2310-6018/2024.45.2.041 (In Russ).

448

Full text in PDF

Received 27.05.2024

Revised 14.06.2024

Accepted 20.06.2024

Published 30.06.2024