Алгоритмы кластеризации неструктурированных текстов и их реализация в программных системах
Работая с сайтом, я даю свое согласие на использование файлов cookie. Это необходимо для нормального функционирования сайта, показа целевой рекламы и анализа трафика. Статистика использования сайта обрабатывается системой Яндекс.Метрика
Научный журнал Моделирование, оптимизация и информационные технологииThe scientific journal Modeling, Optimization and Information Technology
Online media
issn 2310-6018

Clustering algorithms for unstructured texts and their implementation in software systems

idKondakov V.S., idKuznetsova A.V.

UDC 004.912
DOI: 10.26102/2310-6018/2025.50.3.010

  • Abstract
  • List of references
  • About authors

The relevance of this study is driven by the rapid growth of unstructured textual data in the digital environment and the pressing need for its systematic analysis. The lack of universal and easily reproducible methods for grouping textual information complicates interpretation and limits practical application across various domains, including healthcare, education, marketing, and the corporate sector. In response to this challenge, the present article aims to identify key algorithmic approaches to clustering unstructured texts and to analyze software systems implementing these methods. The primary research strategy is based on a comparative and analytical approach that enables the generalization and classification of contemporary machine learning algorithms applied to text data processing. The study reviews both traditional clustering techniques and advanced architectures incorporating unsupervised learning, numerical vector representations, and neural network models. Software tools are examined with a focus on their levels of accuracy, interpretability, and adaptability. As a result, the study systematizes criteria for selecting methods according to specific tasks, highlights limitations of existing approaches, and outlines promising directions for further development. The findings are intended to support professionals engaged in designing and deploying software solutions for the automatic processing and analysis of textual information.

1. Arnarsson I.Ö., Frost O., Gustavsson E., Jirstrand M., Malmqvist J. Natural Language Processing Methods for Knowledge Management–Applying Document Clustering for Fast Search and Grouping of Engineering Documents. Concurrent Engineering: Research and Applications. 2021;29(2):142–152. https://doi.org/10.1177/1063293X20982973

2. Voskergian D., Jayousi R., Yousef M. Topic Selection for Text Classification Using Ensemble Topic Modeling with Grouping, Scoring, and Modeling Approach. Scientific Reports. 2024;14. https://doi.org/10.1038/s41598-024-74022-2

3. Kovtun D.B. Research of the Intradepartmental Authority of the Russian Federation Based on Strategic Planning Using Text Mining Technology. Moscow Economic Journal. 2021;(2). (In Russ.). https://doi.org/10.24412/2413-046X-2021-10119

4. Shi H., Sakai T. Self-Supervised and Few-Shot Contrastive Learning Frameworks for Text Clustering. IEEE Access. 2023;11:84134–84143. https://doi.org/10.1109/ACCESS.2023.3302913

5. Tulli S.K.C. Enhancing Software Architecture Recovery: A Fuzzy Clustering Approach. International Journal of Modern Computing. 2024;7(1):141–153.

6. Khodeir N., Elghannam F. Efficient Topic Identification for Urgent MOOC Forum Posts Using BERTopic and Traditional Topic Modeling Techniques. Education and Information Technologies. 2025;30:5501–5527. https://doi.org/10.1007/s10639-024-13003-4

7. Grootendorst M. BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure. arXiv. URL: https://arxiv.org/abs/2203.05794 [Accessed 24th May 2025].

8. Cozzolino I., Ferraro M.B. Document Clustering. WIREs Computational Statistics. 2022;14(6). https://doi.org/10.1002/wics.1588

9. Kavitha D., Anandha Mala G.S., Padmavathi B., Varshni S.V. Text Mining: Clustering Using BERT and Probabilistic Topic Modeling. Social Informatics Journal. 2023;2(2):1–13. https://doi.org/10.58898/sij.v2i2.01-13

10. Subakti A., Murfi H., Hariadi N. The Performance of BERT as Data Representation of Text Clustering. Journal of Big Data. 2022;9. https://doi.org/10.1186/s40537-022-00564-9

11. Ahmed M.H., Tiun S., Omar N., Sani N.S. Short Text Clustering Algorithms, Application and Challenges: A Survey. Applied Sciences. 2023;13(1). https://doi.org/10.3390/app13010342

12. Maslova M.A. An Automated Approach to Selecting Sentences for Test Case Generation. Computational Nanotechnology. 2024;11(2):29–34. (In Russ.). https://doi.org/10.33693/2313-223X-2024-11-2-29-34

13. Probierz B., Kozak J., Hrabia A. Clustering of Scientific Articles Using Natural Language Processing. Procedia Computer Science. 2022;207:3449–3458. https://doi.org/10.1016/j.procs.2022.09.403

14. Muennighoff N., Tazi N., Magne L., Reimers N. MTEB: Massive Text Embedding Benchmark. arXiv. URL: https://arxiv.org/abs/2210.07316 [Accessed 24th May 2025].

15. Yan H., Gui L., He Yu. Hierarchical Interpretation of Neural Text Classification. Computational Linguistics. 2022;48(4):987–1020. https://doi.org/10.1162/coli_a_00459

16. Ali Bukar U., Sayeed M.S., Razak S.F.A., Yogarayan S., Amodu O.A., Mahmood R.A.R. A Method for Analyzing Text Using VOSviewer. MethodsX. 2023;11. https://doi.org/10.1016/j.mex.2023.102339

17. Anferova M.S., Belevtsev A.M. Development of Algorithms of Intelligent Service for Information Search and Monitoring. Izvestiya SFedU. Engineering Sciences. 2021;(3):6–17. (In Russ.). https://doi.org/10.18522/2311-3103-2021-3-6-17

18. Zabbarov Z.R., Volkov A.K. A Method for Identifying Relevant Topics of Pilot Simulator Training Based on Clustering of Flight Safety Reports. Civil Aviation High Technologies. 2024;27(4):34–49. (In Russ.). https://doi.org/10.26467/2079-0619-2024-27-4-34-49

19. Gubanov A.R., Danilov A.A., Isaev Y.N., Gubanova G.F. Problems of Extracting Semi-Structured Textual Information Based on Text Mining Technology (Using the Material of the Russian and Chuvash Languages). Philology. Theory & Practice. 2024;17(9):3085–3090. (In Russ.). https://doi.org/10.30853/phil20240437

20. Khan W., Kumar T., Zhang Ch., Raj K., Roy A.M., Luo B. SQL and NoSQL Database Software Architecture Performance Analysis and Assessments–A Systematic Literature Review. Big Data and Cognitive Computing. 2023;7(2). https://doi.org/10.3390/bdcc7020097

21. Mehta V., Bawa S., Singh J. WEClustering: Word Embeddings Based Text Clustering Technique for Large Datasets. Complex & Intelligent Systems. 2021;7(6):3211–3224. https://doi.org/10.1007/s40747-021-00512-9

22. Zelenkov Yu., Anisichkina E. Trends in Data Mining Research: A Two-Decade Review Using Topic Analysis. Business Informatics. 2021;15(1):30–46. https://doi.org/10.17323/2587-814X.2021.1.30.46

23. Park Ju.Yo., Mistur E., Kim D., Mo Yu., Hoefer R. Toward Human-Centric Urban Infrastructure: Text Mining for Social Media Data to Identify the Public Perception of COVID-19 Policy in Transportation Hubs. Sustainable Cities and Society. 2022;76. https://doi.org/10.1016/j.scs.2021.103524

24. Rashid Ju., Kim Ju., Hussain A., Naseem U., Juneja S. A Novel Multiple Kernel Fuzzy Topic Modeling Technique for Biomedical Data. BMC Bioinformatics. 2022;23. https://doi.org/10.1186/s12859-022-04780-1

25. Goh K.H., Wang L., Yeow A.Yo.K., et al. Artificial Intelligence in Sepsis Early Prediction and Diagnosis Using Unstructured Data in Healthcare. Nature Communications. 2021;12. https://doi.org/10.1038/s41467-021-20910-4

26. Melton Ch.A., Olusanya O.A., Ammar N., Shaban-Nejad A. Public Sentiment Analysis and Topic Modeling Regarding COVID-19 Vaccines on the Reddit Social Media Platform: A Call to Action for Strengthening Vaccine Confidence. Journal of Infection and Public Health. 2021;14(10):1505–1512. https://doi.org/10.1016/j.jiph.2021.08.010

27. Alzate M., Arce-Urriza M., Cebollada J. Mining the Text of Online Consumer Reviews to Analyze Brand Image and Brand Positioning. Journal of Retailing and Consumer Services. 2022;67. https://doi.org/10.1016/j.jretconser.2022.102989

28. Fayou S., Ngo H.Ch., Sek Yo.W., Meng Z. Clustering Swap Prediction for Image-Text Pre-Training. Scientific Reports. 2024;14. https://doi.org/10.1038/s41598-024-60832-x

Kondakov Vyacheslav Sergeevich

ORCID |

Platov South-Russian State Polytechnic University (NPI)

Novocherkassk, Russian Federation

Kuznetsova Alla Vitalievna
Candidate of Engineering Sciences, Docent

Scopus | ORCID | eLibrary |

Platov South-Russian State Polytechnic University (NPI)

Novocherkassk, Russian Federation

Keywords: text clustering, unstructured data, topic modeling, machine learning, vector representations, unsupervised algorithms, software frameworks, text mining

For citation: Kondakov V.S., Kuznetsova A.V. Clustering algorithms for unstructured texts and their implementation in software systems. Modeling, Optimization and Information Technology. 2025;13(3). URL: https://moitvivt.ru/ru/journal/pdf?id=1970 DOI: 10.26102/2310-6018/2025.50.3.010 (In Russ).

23

Full text in PDF

Received 25.05.2025

Revised 23.06.2025

Accepted 03.07.2025