Word processing and preparation of vectorization models for a software package for the classification of scientific texts

idGusev P.Y.

UDC 004.622
DOI: 10.26102/2310-6018/2021.32.1.010

Abstract
List of references
About authors

The task of classifying a scientific specialty is a complex process in which, as a rule, a team of specialists in a certain scientific direction is involved. One of the most common situations in which such a task arises is the definition of a scientific specialty when defending a dissertation. When solving such a problem, you can use existing scientific texts in specialties. The most indicative set of texts on a particular specialty is a set of abstracts. Before creating an intelligent classification system for a scientific specialty, it is necessary to process the texts of abstracts and their vectorization, which will provide the possibility of training models. Different types of word processing have different effects on the final result. This paper compares different methods of preparing texts. At the same time, special attention is paid to the possibility of using the methods on data sets of different sizes. Investigation of ways of preparing texts on a small data set, and then scaling the same methods for a large data set will provide a significant reduction in the computer time spent on working with texts. As a result of the research, the most effective combination of methods for preparing text data has been established. Further vectorization of texts is possible in different ways. The paper considers the possibility of vectorization using the TF-IDF method. To ensure the best result of the machine learning models, experiments were carried out to select the optimal hyperparameters of the vectorizer. As a result of the experiments, the influence of various changes in hyperparameters on the final result of the machine learning model was evaluated.

1. Ivanov N.N. Sintaksicheskii razbor predlozheniya dlya vektorizatsii teksta. Voprosy nauki i obrazovaniya. 2017;11(12):45-46. (In Russ)

2. Spivak A.I., Lapshin S.V., Lebedev I.S. Klassifikatsiya korotkikh soobshchenii s ispol'zovaniem vektorizatsii na osnove elmo. Izvestiya Tul'skogo gosudarstvennogo universiteta. Tekhnicheskie nauki. 2019;10:410-418. (In Russ)

3. Flach, P.. Mashinnoe obuchenie. Nauka i iskusstvo postroeniya algoritmov, kotorye izvlekayut znaniya iz dannykh. Litres, 2019. (In Russ)

4. Borodin A.I., Veinberg R.R., Litvishko O.V. Methods of text processing when creating chatbots. Khumanitarni Balkanski izsledvaniya. 2019;3(3(5)):108-111. DOI: 10.34671/sch.hbr.2019.0303.0026 (In Russ)

5. Kaibasova D.Zh. Izvlechenie statisticheskikh dannykh dlya opredeleniya unikal'nosti dokumentov na osnove analiz kontenta uchebnykh programm distsiplin. The Scientific Heritage. 2020;44-1(44):57-62 (In Russ)

6. Krotova O.S., Moskalev I.V., Khvorova L.A., Nazarkina O.M. Realizatsiya effektivnykh modelei klassifikatsii meditsinskikh dannykh metodami intellektual'nogo analiza tekstovoi informatsii. Izvestiya Altaiskogo gosudarstvennogo universiteta. 2020;1(111):99-104. (In Russ)

7. Isachenko V.V., Apanovich Z.V. Sistema analiza i vizualizatsii dlya kross-yazykovoi identifikatsii avtorov nauchnykh publikatsii. Vestnik Novosibirskogo gosudarstvennogo universiteta. Seriya: Informatsionnye tekhnologii. 2018;16(2):49-61. DOI: 10.25205/1818-7900-2018-16-2-49-61. (In Russ)

8. Zherebtsova Yu.A., Chizhik A.V. Sozdanie chat-bota: obzor arkhitektur i vektornykh predstavlenii teksta. International Journal of Open Information Technologies. 2020;8(7):50-56. (In Russ)

9. Popova E.P., Leonenko V.N.. Prognozirovanie reaktsii pol'zovatelei v sotsial'nykh setyakh metodami mashinnogo obucheniya. Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 2020;20(1):118-124. (In Russ)

10. Udhayakumar S., Nancy J.S., UmaNandhini D., Ashwin P., Ganesh R. Context Aware Text Classification and Recommendation Model for Toxic Comments Using Logistic Regression. Intelligence in Big Data Technologies—Beyond the Hype. Springer, Singapore. 2021;209-217. DOI: 10.1007/978-981-15-5285-4_20.

11. De Cock M., Dowsley R., Nascimento A.C., Railsback D., Shen J., Todoki A. (2021). High performance logistic regression for privacy-preserving genome analysis. BMC Medical Genomics. 2021;14(1):1-18. DOI: 10.21203/rs.3.rs-26375/v1.

12. Kumar V., Subba B. (2020, February). A TfidfVectorizer and SVM based sentiment analysis framework for text data corpus. 2020 National Conference on Communications (NCC). IEEE. 2020;1-6. DOI: 10.1109/ncc48643.2020.9056085.

13. Subba B., Gupta P. A tfidfvectorizer and singular value decomposition based host intrusion detection system framework for detecting anomalous system processes. Computers & Security. 2021;100. DOI: 10.1016/j.cose.2020.102084.

14. Abramov P.S. Izvlechenie klyuchevoi informatsii iz teksta. Novye informatsionnye tekhnologii v avtomatizirovannykh sistemakh. 2018;21:217-219. (In Russ)

Gusev Pavel Yrievich
Ph. D

ORCID | eLibrary |

Voronezh State Technical University

Voronezh, Russia

Keywords: word processing, vectorization, software package, intelligent system, modeling

For citation: Gusev P.Y. Word processing and preparation of vectorization models for a software package for the classification of scientific texts. Modeling, Optimization and Information Technology. 2021;9(1). Available from: https://moitvivt.ru/ru/journal/pdf?id=912 DOI: 10.26102/2310-6018/2021.32.1.010 (In Russ).

605

Full text in PDF