Keywords: multimodal medical data, system analysis, distributed data processing, apache Spark, intelligent systems, diagnostics, hybrid architecture, big data
An approach to building a distributed analytical platform for multimodal medical data in clinical diagnostic tasks
UDC 004.9:616(043)
DOI: 10.26102/2310-6018/2025.51.4.069
An approach to building a distributed analytical platform for deep processing of multimodal medical data, focused on clinical diagnostic tasks and support for medical decisions, is presented. The initial premise is the growth of heterogeneous data (DICOM images, electronic medical records, laboratory parameters) in conditions of centralization through EGISZ class systems with a continuing shortage of specialized tools for complex analysis in real clinical practice. The key element of the platform is a hybrid processing model that combines a distributed pipeline on Apache Spark with a modular data preparation system and a multimodal transformer for cross-modal analysis. The pipeline implements specialized procedures for tokenization and normalization of texts (Spark NLP), metadata extraction, and DICOM image conversion to numeric representations. At the high-performance computing level, a scalable Apache Spark core is used with the ability to transfer prepared samples to a GPU-oriented service via Petastorm and PyTorch. The multimodal transformer combines embeddings of images (ViT), clinical text descriptions (BioClinicalBERT), and tabular features, using positional encoding and several layers of self-attention to form an aggregated representation of the treatment episode. A software prototype of the platform using Docker has been developed. Experiments on a synthetic set of multimodal data have demonstrated the ability to identify statistically significant and clinically relevant patterns (for example, the association of pneumonia with COPD) at high performance.
1. Hao Y., Cheng Ch., Li J., et al. Multimodal Integration in Health Care: Development with Applications in Disease Management. Journal of Medical Internet Research. 2025;27. https://doi.org/10.2196/76557
2. Liu C., Ye F. A Review of Multimodal Medical Data Fusion Techniques for Personalized Medicine. In: IC-BIS '25: Proceedings of the 4th International Conference on Biomedical and Intelligent Systems, 11–13 April 2025, Bologna, Italy. New York: Association for Computing Machinery; 2025. P. 338–347. https://doi.org/10.1145/3745034.3745088
3. Krones F., Marikkar U., Parsons G., Szmul A., Mahdi A. Review of Multimodal Machine Learning Approaches in Healthcare. Information Fusion. 2025;114. https://doi.org/10.1016/j.inffus.2024.102690
4. Xie Ch., Ningc Z., Guo T., et al. Multimodal Data Integration for Biologically-Relevant Artificial Intelligence to Guide Adjuvant Chemotherapy in Stage II Colorectal Cancer. eBioMedicine. 2025;117. https://doi.org/10.1016/j.ebiom.2025.105789
5. Heydari M., Sarshar R., Soltanshahi M.A. Distributed Record Linkage in Healthcare Data with Apache Spark. arXiv. URL: https://arxiv.org/abs/2404.07939 [Accessed 21st November 2025].
6. Deshpande P., Rasin A., Tchoua R. Biomedical Heterogeneous Data Categorization and Schema Mapping Toward Data Integration. Frontiers in Big Data. 2023;6. https://doi.org/10.3389/fdata.2023.1173038
7. Acosta J.N., Falcone G.J., Rajpurkar P., Topol E.J. Multimodal Biomedical AI. Nature Medicine. 2022;28(9):1773–1784. https://doi.org/10.1038/s41591-022-01981-2
8. Musik S., Sasin-Kurowska J., Panczyk M. Bridging the Past and Future of Clinical Data Management: The Transformative Impact of Artificial Intelligence. Open Access Journal of Clinical Trials. 2025;17:15–33. https://doi.org/10.2147/OAJCT.S509921
9. Hagan N.K.A., Talburt J.R. SparkDWM: A Scalable Design of a Data Washing Machine Using Apache Spark. Frontiers in Big Data. 2024;7. https://doi.org/10.3389/fdata.2024.1446071
10. Valo P., Tran A., Baranton E., Haas H., Freyssinet E., Vrzáková H. Clinical Data Integration and Processing Challenges in Healthcare Caused by Contemporary Software Design. Digital Health. 2025;11. https://doi.org/10.1177/20552076251374233
11. Shrotriya L., Sharma K., Parashar D., Mishra K., Singh Rawat S., Pagare H. Apache Spark in Healthcare: Advancing Data-Driven Innovations and Better Patient Care. International Journal of Advanced Computer Science and Applications. 2023;14(6):608–616. https://doi.org/10.14569/IJACSA.2023.0140665
12. Tu Y., Lu Y., Chen G., Zhao J., Yi F. Architecture Design of Distributed Medical Big Data Platform Based on Spark. In: 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), 24–26 May 2019, Chongqing, China. IEEE; 2019. P. 682–685. https://doi.org/10.1109/ITAIC.2019.8785620
Keywords: multimodal medical data, system analysis, distributed data processing, apache Spark, intelligent systems, diagnostics, hybrid architecture, big data
For citation: Pozharsky R.V., Petrova E.S. An approach to building a distributed analytical platform for multimodal medical data in clinical diagnostic tasks. Modeling, Optimization and Information Technology. 2025;13(4). URL: https://moitvivt.ru/ru/journal/pdf?id=2141 DOI: 10.26102/2310-6018/2025.51.4.069 (In Russ).
Received 27.11.2025
Revised 19.12.2025
Accepted 25.12.2025
Published 31.12.2025