A comparative study of deep learning architectures for interpretable diagnosis of retinal diseases

Miroshnichenko V.V., idKashirina I.L.

UDC 004.932.2
DOI: 10.26102/2310-6018/2026.53.2.016

Abstract
List of references
About authors

Interpretability of deep learning decisions remains a critical requirement for their application in medical diagnostics. This study presents a comparative analysis of three modern neural network architectures—Vision Transformer (ViT), Swin Transformer, and ConvNeXt – for multiclass classification of retinal diseases using optical coherence tomography (OCT) images. The research was conducted on the open OCTDL dataset containing 2.064 images across seven diagnostic categories with pronounced class imbalance. To compensate for this imbalance, a loss function weighting strategy was employed. All three models achieved validation accuracy exceeding 0.91, with ConvNeXt demonstrating the best performance (0.945) and an optimal balance of sensitivity and specificity, particularly for rare pathologies. Model interpretability was evaluated using Grad-CAM, attention weight visualization, and the model-agnostic LIME method. The analysis revealed that ConvNeXt combined with Grad-CAM provides the most reliable localization of clinically significant features, whereas ViT attention maps and Swin Transformer activation maps often appeared blurred or focused on non-informative regions. The results confirm the advantage of ConvNeXt as the most promising architecture for clinical deployment in ophthalmological diagnostics, owing to its combination of high accuracy, interpretability, and moderate computational requirements.

1. Kurakina V.M., Vitushkina E.V. Optical Coherence Tomography. Clinical Gerontology. 2010;16(9-10):44. (In Russ.).

2. Kermany D.S., Goldbaum M., Cai W., et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell. 2018;172(5):1122–1131. https://doi.org/10.1016/j.cell.2018.02.010

3. Naim K., Darouichi A. Deep Learning-Based Classification of Retinal Pathologies. Statistics, Optimization and Information Computing. 2025;15(2):1226–1235. https://doi.org/10.19139/soic-2310-5070-2767

4. He J., Wang J., Han Z., Ma J., Wang Ch., Qi M. An interpretable transformer network for the retinal disease classification using optical coherence tomography. Scientific Reports. 2023;13. https://doi.org/10.1038/s41598-023-30853-z

5. Kulyabin M., Zhdanov A., Nikiforova A., et al. OCTDL: Optical Coherence Tomography Dataset for Image-Based Deep Learning Methods. Scientific Data. 2024;11. https://doi.org/10.1038/s41597-024-03182-7

6. Dosovitskiy A., Beyer L., Kolesnikov A., et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: 9th International Conference on Learning Representations, ICLR 2021, 03–07 May 2021, Virtual Event, Austria. 2021. https://doi.org/10.48550/arXiv.2010.11929

7. Liu Z., Lin Y., Cao Y., et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 10–17 October 2021, Montreal, QC, Canada. IEEE; 2021. P. 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986

8. Liu Zh., Mao H., Wu Ch.-Y., et al. A ConvNet for the 2020s. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18–24 June 2022, New Orleans, LA, USA. IEEE; 2022. P. 11966–11976. https://doi.org/10.1109/CVPR52688.2022.01167

9. Yengec-Tasdemir S.B., Akay E., Dogan S., Yilmaz B. Classification of Colorectal Polyps from Histopathological Images using Ensemble of ConvNeXt Variants. [Preprint]. Research Square. URL: https://doi.org/10.21203/rs.3.rs-1791422/v1 [Accessed 12th January 2026].

10. Selvaraju R.R., Cogswell M., Das A., Vedantam R., Parikh D., Batra D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In: 2017 IEEE International Conference on Computer Vision (ICCV), 22–29 October 2017, Venice, Italy. IEEE; 2017. P. 618–626. https://doi.org/10.1109/ICCV.2017.74

11. Ribeiro M.T., Singh S., Guestrin C. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In: KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 13–17 August 2016, San Francisco, CA, USA. New York: Association for Computing Machinery; 2016. P. 1135–1144. https://doi.org/10.1145/2939672.2939778

12. Cheremiskin A.V., Kashirina I.L. Segmentation of Multiphase CT Images Using an Ensemble of ResUNet Models. Proceedings of Voronezh State University. Series: Systems Analysis and Information Technologies. 2025;(3):140–152. (In Russ.). https://doi.org/10.17308/sait/1995-5499/2025/3/140-152

Miroshnichenko Viktor Vyacheslavovich

MIREA – Russian Technological University

Moscow, Russian Federation

Kashirina Irina Leonidovna
Doctor of Engineering Sciences, Professor

WoS | Scopus | ORCID | eLibrary |

MIREA – Russian Technological University

Moscow, Russian Federation

Keywords: deep learning, vision Transformer, swin Transformer, convNeXt, retinal diseases, grad-CAM

For citation: Miroshnichenko V.V., Kashirina I.L. A comparative study of deep learning architectures for interpretable diagnosis of retinal diseases. Modeling, Optimization and Information Technology. 2026;14(2). URL: https://moitvivt.ru/ru/journal/article?id=2195 DOI: 10.26102/2310-6018/2026.53.2.016 (In Russ).

129

Full text in PDF

Скачать JATS XML

Received 31.01.2026

Revised 22.02.2026

Accepted 26.02.2026

Published 28.02.2026