Keywords: outliers, machine learning, training sample, ensemble method, z-score, interquartile range method
Ensemble methods for detecting outliers in the preparation of a training data set
UDC 004.622
DOI: 10.26102/2310-6018/2022.38.3.013
Most machine learning methods are most effective when working with data that satisfies a nor-mal distribution. On the other hand, the training set often contains “outliers” of various nature, which can significantly reduce the accuracy of machine learning methods. Thus, in any machine learning task, there is a problem of detecting outliers. The article provides a classification of the main types of emissions. Various methods for detecting one-dimensional outliers are considered: the method using the Grubbs criterion; Z-score method; robust Z-score (RZ-score) method; in-terquartile range (IQR) method; Winsorization method. The methods for detecting one-dimensional outliers are compared. For the automated detection of outliers, an ensemble method has been proposed that combines various methods for detecting one-dimensional outliers. The ensemble method helps to configure an automated outlier detection procedure according to the rule of the required severity. The suggested method is applied to analyze and detect outliers in data on sales of goods during the promotion in a large retail network. The applicability of using outlier detection method ensemble to stratification of the training sample is shown. At the same time, the absolute and relative forecasting error of the final model decreased by 5% compared to the initial one.
1. Reinsel D., Gantz J., Rydning J. The Digital of the World – From Edge to Core. IDC White Paper. 2018. Available by: https://www.seagate.com/ru/ru/our–story/data–age–2025/ (ac-cessed on 17.05.2021).
2. Parasich A.V., Parasich V.A., Parasich I.V. Formirovanie obuchajushhej vyborki v zadachah mashinnogo obuchenija. Obzor. Informacionno-upravljajushhie sistemy. 2021;4(113):61–68. (In Russ.).
3. Jakimova V.A. Vozmozhnosti i perspektivy ispol'zovanija cifrovyh tehnologij v auditorskoj dejatel'nosti. Vestnik Sankt-Peterburgskogo universiteta. Jekonomika. 2020;2:287–318. (In Russ.).
4. Beketnova Ju.M. Sravnitel'nyj analiz metodov mashinnogo obuchenija pri identifikacii prizna-kov vovlechennosti kreditnyh organizacij i ih klientov v somnitel'nye operacii. Finansy: teorija i praktika. 2021;5:186–199. (In Russ.).
5. Shulenin V.P. Robust Alternatives to the Standard Deviation in Processing of Physics Exper-imental Data. Russian Physics Journal. 2016;59(6):824–832.
6. Serysheva I.A. Fil'tracija vybrosov v zadachah staticheskoj i dinamicheskoj obrabotki dannyh v jetalonah vremeni i chastoty. Vestnik Irkutskogo gosudarstven-nogo tehnicheskogo universi-teta. 2018;22(10):67–77. (In Russ.).
7. Gorjainov V.B., Gorjainova E.R. Vlijanie anomal'nyh nabljudenij na ocenku naimen'shih kvadratov parametra avtoregressionnogo uravnenija so sluchajnym kojefficietom. Vestnik MGTU im. N.Je. Baumana. Ser. Estestvennye nauki. 2016;2:16–24. DOI: 10.18698/1812-3368-2016-2-16-24. (In Russ.).
8. Piryonesi S. Madeh, El-Diraby, Tamer E. Role of Data Analytics in Infrastructure Asset Man-agement: Overcoming Data Size and Quality Problems. Journal of Transportation Engineer-ing, Part B: Pavements. 2020:146–148.
9. David A. Freedman. Statistical Models: Theory and Practice. Cambridge University Press; 2009. 442 p.
10. Gianni Franchi, Andrei Bursuc, Emanuel Aldea, Séverine Dubuisson, Isabelle Bloch. TRADI: Tracking deep neural network weight distributions. 16th European Conference on Computer Vision. 2020:1–27.
11. Lezhebokov A.A., Kuliev Je.V. Tehnologii vizualizacii dlja prikladnyh zadach intel-lektual'nogo analiza dannyh. Izvestija KBNC RAN. 2019;4(90):14–23. (In Russ.).
12. Zhitnyj M.V., Devjatkina T.Ju., Hublarova T.S., Prohvatova I.S. Metodika jeksperi-mental'nogo modelirovanija udarnogo vozdejstvija imitatorov chastic kosmi-cheskogo musora na solnechnye jelementy kosmicheskogo apparata. Izvestija TulGU. Tehnicheskie nauki. 2020;5:32–40. (In Russ.).
13. Shirjaeva L.K., Repina E.G. O nekotoryh svojstvah simmetrichnoj kopuly Grabbsa. Vestn. Sam. gos. tehn. un-ta. Ser. Fiz.-mat. Nauki. 2018;22(4):714–734. DOI: 10.14498/vsgtu1640. (In Russ.).
14. McLeod S.A. Z-score: definition, calculation and interpretation. Simply Psychology; 2019. Available by: https://www.simplypsychology.org/z-score.html (accessed on 17.05.2021).
15. Sapoetra, D.B., Basuki, R. Effect of service quality, religiosity, relationship closeness, and customer trust on customer satisfaction and loyalty at Bank Jatim Syariah. RJOAS. 2019;3:200–219.
16. Nurunnabi A., West G., Belton D. Robust Outlier Detection and Saliency Features Estimation in Point Cloud Data. 2013 International Conference on Computer and Robot Vision. 2013:98–105.
17. Vyhodcev N.A. Ispol'zovanie iskusstvennogo intellekta dlja ocenki stoimosti nedvizhimogo imushhestva. Doklady TUSUR. 2021;1:68–72. (In Russ.).
18. Chernov G. How to learn to defeat noisy robot in rock-paper-scissors game: an ex-ploratory study. Jekonomicheskij zhurnal VShJe. 2020;4:503–538.
19. Evseeva S.A. Issledovanie jeffektivnosti procedur kollektivnogo vyvoda pri reshenii zadachi klassifikacii. Aktual'nye problemy aviacii i kosmonavtiki. 2019;2:41–43. (In Russ.).
20. Lee B.K., Lessler J., Stuart E.A. Weight Trimming and Propensity Score Weighting. PLoS ONE. 2011;6(3). DOI: 10.1371/journal.pone.0018174.
21. Mikrjukov A.A., Babash A.V., Sizov V.A. Klassifikacija sobytij v sistemah obespechenija in-formacionnoj bezopasnosti na osnove nejrosetevyh tehnologij. Otkrytoe obrazovanie. 2019;1:57–63. (In Russ.).
22. Protasov V.I., Potapova Z.E. Metodika kardinal'nogo snizhenija verojatnosti prinjatija oshibochnyh reshenij v sistemah kollektivnogo intellekta. Sovremennye informacionnye tehnologii i IT-obrazovanie. 2019;3:588–601. (In Russ.).
23. Baharad E., Goldberger J., Koppel M., Nitzan S. Beyond Condorcet: optimal aggre-gation rules using voting records. Theory and Decision. 2012;72(1):113–130.
24. Dorofeev V.S., Volosatova T.M. Algoritm podgotovki obuchajushhej vyborki v zadache prognozirovanija sprosa. Matematicheskie metody v tehnologijah i tehnike. 2021;2:64–68. (In Russ.).
25. Prokhorenkova L., Gusev G., Vorobev A., Dorogush A.V., Gulin A. CatBoost: unbi-ased boosting with categorical features. Advances in Neural Information Processing Systems. 2018:6637–6647.
Keywords: outliers, machine learning, training sample, ensemble method, z-score, interquartile range method
For citation: Dorofeev V.S., Volosatova T.M. Ensemble methods for detecting outliers in the preparation of a training data set. Modeling, Optimization and Information Technology. 2022;10(3). URL: https://moitvivt.ru/ru/journal/pdf?id=1210 DOI: 10.26102/2310-6018/2022.38.3.013 (In Russ).
Received 11.07.2022
Revised 25.07.2022
Accepted 16.09.2022
Published 30.09.2022