References

moitvivt

Моделирование, оптимизация и информационные технологии

Modeling, Optimization and Information Technology

2310-6018

Издательство

10.26102/2310-6018/2025.50.3.010

1970

Алгоритмы кластеризации неструктурированных текстов и их реализация в программных системах

Clustering algorithms for unstructured texts and their implementation in software systems

0009-0006-4058-9835

Кондаков

Вячеслав Сергеевич

Kondakov

Vyacheslav Sergeevich

wspc777@yandex.ru aff-1

0000-0001-5028-0053

Кузнецова

Алла Витальевна

Kuznetsova

Alla Vitalievna

alvitkuz@yandex.ru aff-2

Южно-Российский государственный политехнический университет (НПИ) имени М.И. Платова Platov South-Russian State Polytechnic University (NPI)

01 01 2026

1 1

10.26102/2310-6018/2025.50.3.010

2026

This work is licensed under a Creative Commons Attribution 4.0 International License

Актуальность исследования обусловлена стремительным ростом объема неструктурированных текстов в цифровой среде и необходимостью их систематического анализа. Отсутствие универсальных и легко воспроизводимых решений по группировке текстовой информации затрудняет ее интерпретацию и ограничивает возможности применения в различных прикладных сферах, включая здравоохранение, образование, маркетинг и корпоративный сектор. В связи с этим данная статья направлена на выявление ключевых алгоритмических подходов к кластеризации неструктурированных текстов, а также на анализ программных систем, реализующих соответствующие методы. Ведущий метод исследования основан на сравнительно-аналитическом подходе, позволившем обобщить и классифицировать современные алгоритмы машинного обучения, применяемые для обработки текстовых данных. В работе рассмотрены как традиционные методы кластеризации, так и современные архитектуры, использующие обучение без учителя, числовые векторные представления и нейросетевые модели. Проанализированы программные инструменты, демонстрирующие различные уровни точности, интерпретируемости и адаптивности. В результате систематизированы критерии выбора методов под конкретные задачи, выявлены ограничения существующих подходов и обозначены перспективные направления развития. Материалы статьи могут быть полезны специалистам, занимающимся проектированием и внедрением программных решений для автоматической обработки и анализа текстовой информации.

The relevance of this study is driven by the rapid growth of unstructured textual data in the digital environment and the pressing need for its systematic analysis. The lack of universal and easily reproducible methods for grouping textual information complicates interpretation and limits practical application across various domains, including healthcare, education, marketing, and the corporate sector. In response to this challenge, the present article aims to identify key algorithmic approaches to clustering unstructured texts and to analyze software systems implementing these methods. The primary research strategy is based on a comparative and analytical approach that enables the generalization and classification of contemporary machine learning algorithms applied to text data processing. The study reviews both traditional clustering techniques and advanced architectures incorporating unsupervised learning, numerical vector representations, and neural network models. Software tools are examined with a focus on their levels of accuracy, interpretability, and adaptability. As a result, the study systematizes criteria for selecting methods according to specific tasks, highlights limitations of existing approaches, and outlines promising directions for further development. The findings are intended to support professionals engaged in designing and deploying software solutions for the automatic processing and analysis of textual information.

кластеризация текстов неструктурированные данные тематическое моделирование машинное обучение векторные представления алгоритмы без учителя программные фреймворки интеллектуальный анализ текста

text clustering unstructured data topic modeling machine learning vector representations unsupervised algorithms software frameworks text mining

Исследование выполнено без спонсорской поддержки.

The study was performed without external funding.

References 1

Arnarsson I.Ö., Frost O., Gustavsson E., Jirstrand M., Malmqvist J. Natural Language Processing Methods for Knowledge Management–Applying Document Clustering for Fast Search and Grouping of Engineering Documents. Concurrent Engineering: Research and Applications. 2021;29(2):142–152. https://doi.org/10.1177/1063293X20982973

Voskergian D., Jayousi R., Yousef M. Topic Selection for Text Classification Using Ensemble Topic Modeling with Grouping, Scoring, and Modeling Approach. Scientific Reports. 2024;14. https://doi.org/10.1038/s41598-024-74022-2

Ковтун Д.Б. Исследование внутриведомственного взаимодействия органов власти РФ на основе документов стратегического планирования с помощью технологии Text Mining. Московский экономический журнал. 2021;(2). https://doi.org/10.24412/2413-046X-2021-10119

Shi H., Sakai T. Self-Supervised and Few-Shot Contrastive Learning Frameworks for Text Clustering. IEEE Access. 2023;11:84134–84143. https://doi.org/10.1109/ACCESS.2023.3302913

Tulli S.K.C. Enhancing Software Architecture Recovery: A Fuzzy Clustering Approach. International Journal of Modern Computing. 2024;7(1):141–153.

Khodeir N., Elghannam F. Efficient Topic Identification for Urgent MOOC Forum Posts Using BERTopic and Traditional Topic Modeling Techniques. Education and Information Technologies. 2025;30:5501–5527. https://doi.org/10.1007/s10639-024-13003-4

Grootendorst M. BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure. arXiv. URL: https://arxiv.org/abs/2203.05794 [Accessed 24th May 2025].

Cozzolino I., Ferraro M.B. Document Clustering. WIREs Computational Statistics. 2022;14(6). https://doi.org/10.1002/wics.1588

Kavitha D., Anandha Mala G.S., Padmavathi B., Varshni S.V. Text Mining: Clustering Using BERT and Probabilistic Topic Modeling. Social Informatics Journal. 2023;2(2):1–13. https://doi.org/10.58898/sij.v2i2.01-13

Subakti A., Murfi H., Hariadi N. The Performance of BERT as Data Representation of Text Clustering. Journal of Big Data. 2022;9. https://doi.org/10.1186/s40537-022-00564-9

Ahmed M.H., Tiun S., Omar N., Sani N.S. Short Text Clustering Algorithms, Application and Challenges: A Survey. Applied Sciences. 2023;13(1). https://doi.org/10.3390/app13010342

Маслова М.А. Автоматизированный подход к отбору предложений для генерации тестовых заданий. Computational Nanotechnology. 2024;11(2):29–34. https://doi.org/10.33693/2313-223X-2024-11-2-29-34

Probierz B., Kozak J., Hrabia A. Clustering of Scientific Articles Using Natural Language Processing. Procedia Computer Science. 2022;207:3449–3458. https://doi.org/10.1016/j.procs.2022.09.403

Muennighoff N., Tazi N., Magne L., Reimers N. MTEB: Massive Text Embedding Benchmark. arXiv. URL: https://arxiv.org/abs/2210.07316 [Accessed 24th May 2025].

Yan H., Gui L., He Yu. Hierarchical Interpretation of Neural Text Classification. Computational Linguistics. 2022;48(4):987–1020. https://doi.org/10.1162/coli_a_00459

Ali Bukar U., Sayeed M.S., Razak S.F.A., Yogarayan S., Amodu O.A., Mahmood R.A.R. A Method for Analyzing Text Using VOSviewer. MethodsX. 2023;11. https://doi.org/10.1016/j.mex.2023.102339

Анферова М.С., Белевцев А.М. Разработка алгоритмов интеллектуального сервиса поиска и мониторинга информации. Известия ЮФУ. Технические науки. 2021;(3):6–17. https://doi.org/10.18522/2311-3103-2021-3-6-17

Заббаров З.Р., Волков А.К. Метод выявления актуальных тем тренажерной подготовки пилотов на основе кластеризации отчетов по безопасности полетов. Научный вестник МГТУ ГА. 2024;27(4):34–49. https://doi.org/10.26467/2079-0619-2024-27-4-34-49

Губанов А.Р., Данилов А.А., Исаев Ю.Н., Губанова Г.Ф. Проблемы извлечения слабоструктурированной текстовой информации на основе технологии Text Mining (на материале русского и чувашского языков). Филологические науки. Вопросы теории и практики. 2024;17(9):3085–3090. https://doi.org/10.30853/phil20240437

Khan W., Kumar T., Zhang Ch., Raj K., Roy A.M., Luo B. SQL and NoSQL Database Software Architecture Performance Analysis and Assessments–A Systematic Literature Review. Big Data and Cognitive Computing. 2023;7(2). https://doi.org/10.3390/bdcc7020097

Mehta V., Bawa S., Singh J. WEClustering: Word Embeddings Based Text Clustering Technique for Large Datasets. Complex & Intelligent Systems. 2021;7(6):3211–3224. https://doi.org/10.1007/s40747-021-00512-9

Зеленков Ю.А., Анисичкина Е.А. Динамика исследований в области интеллектуального анализа данных: тематический анализ публикаций за 20 лет. Бизнес-информатика. 2021;15(1):30–46. (На англ.). https://doi.org/10.17323/2587-814X.2021.1.30.46

Park Ju.Yo., Mistur E., Kim D., Mo Yu., Hoefer R. Toward Human-Centric Urban Infrastructure: Text Mining for Social Media Data to Identify the Public Perception of COVID-19 Policy in Transportation Hubs. Sustainable Cities and Society. 2022;76. https://doi.org/10.1016/j.scs.2021.103524

Rashid Ju., Kim Ju., Hussain A., Naseem U., Juneja S. A Novel Multiple Kernel Fuzzy Topic Modeling Technique for Biomedical Data. BMC Bioinformatics. 2022;23. https://doi.org/10.1186/s12859-022-04780-1

Goh K.H., Wang L., Yeow A.Yo.K., et al. Artificial Intelligence in Sepsis Early Prediction and Diagnosis Using Unstructured Data in Healthcare. Nature Communications. 2021;12. https://doi.org/10.1038/s41467-021-20910-4

Melton Ch.A., Olusanya O.A., Ammar N., Shaban-Nejad A. Public Sentiment Analysis and Topic Modeling Regarding COVID-19 Vaccines on the Reddit Social Media Platform: A Call to Action for Strengthening Vaccine Confidence. Journal of Infection and Public Health. 2021;14(10):1505–1512. https://doi.org/10.1016/j.jiph.2021.08.010

Alzate M., Arce-Urriza M., Cebollada J. Mining the Text of Online Consumer Reviews to Analyze Brand Image and Brand Positioning. Journal of Retailing and Consumer Services. 2022;67. https://doi.org/10.1016/j.jretconser.2022.102989

Fayou S., Ngo H.Ch., Sek Yo.W., Meng Z. Clustering Swap Prediction for Image-Text Pre-Training. Scientific Reports. 2024;14. https://doi.org/10.1038/s41598-024-60832-x

The authors declare that there are no conflicts of interest present.