Keywords: feature selection, binary classification problem, small data analysis, machine learning, assisted reproductive technologies
Comparison of the efficiency of different selecting features methods for solving the binary classification problem of predicting in vitro fertilization pregnancy
UDC 519.683, 519-7
DOI: 10.26102/2310-6018/2020.30.3.025
Determination of the range of factors affecting the object of research is the most important task of medical research. Its solution is complicated by a large amount of diverse data, including extensive anamnestic information and data from clinical studies, often combined with a limited number of observed patients. This work is devoted to the comparison of the results obtained by various feature selection methods for the search for a set of predictors, on the basis of which a model with the best forecast quality was created, for solving the problem of binary classification of predicting the onset of pregnancy during in vitro fertilization (IVF). The data from the anamnesis of women, presented in binary form, were used as features. The sample consisted of 68 features and 689 objects. The signs were examined for the presence of cross-correlation, after which methods and algorithms were applied to search for a selection of significant factors: nonparametric criteria, interval estimate of the shares, Zcriterion for the difference of two shares, mutual information, RFECV, ADD-DELL, Relief algorithms, algorithms based on the permutation importance (Boruta, Permutation Importance, PIMP), feature selection algorithms using model feature importance (lasso, random forest). To compare the quality of the selected sets of features, various classifiers were built, their metric AUC and the complexity of the model were calculated. All models have high prediction quality (AUC above 95%). The best of them are based on features selected using nonparametric criteria, model selection (lasso regression), Boruta, Permutation Importance, RFECV and ReliefF algorithms. The optimal set of predictors is a set of 30 binary features obtained by the Boruta algorithm, due to the lower complexity of the model with a relatively high quality (AUC of the model 0.983). Significant signs includes: data about pregnancies in the anamnesis in general, ectopic and regressive pregnancies, independent and term childbirth, abortions up to 12 weeks; hypertension, ischemia, stroke, thrombosis, ulcers, obesity, diabetes mellitus in the immediate family; currently undergoing hormonal treatment not associated with the IVF procedure; allergies; harmful professional factors; normal duration and stability of the menstrual cycle without taking medication; hysteroscopy, laparoscopy and laparotomy; resection of any organ in the genitourinary system; is it the first IVF, the presence of any surgical interventions, diseases of the genitourinary system; the age and BMI of the patient; absence of chronic diseases; the presence of diffuse fibrocystic mastopathy, hypothyroidism.
1. van Loendersloot L.L., van Wely M., Limpens J., Bossuyt P.M., Repping S., van der Veen F. Predictive factors in in vitro fertilization (IVF): a systematic review and meta-analysis. Hum Reprod Update. 2010;16(6):577–589. DOI: 10.1093/humupd/dmq015
2. Atasever M., Namlı Kalem M., Hatırnaz Ş., Hatırnaz E., Kalem Z., Kalaylıoğlu Z. Factors affecting clinical pregnancy rates after IUI for the treatment of unexplained infertility and mild male subfertility. J Turk Ger Gynecol Assoc. 2016;17:134–138. DOI: 10.5152/jtgga.2016.16056
3. Vaegter K.K., Lakic T.G., Olovsson M., Berglund L., Brodin T., Holte J. Which factors are most predictive for live birth after in vitro fertilization and intracytoplasmic sperm injection (IVF/ICSI) treatments? Analysis of 100 prospectively recorded variables in 8,400 IVF/ICSI single-embryo transfers. Fertil Steril. 2017;107(3):641–648.e2. DOI:10.1016/j.fertnstert.2016.12.005
4. Vogiatzi, P., Pouliakis, A., Siristatidis, C. An artificial neural network for the prediction of assisted reproduction outcome. J Assist Reprod Genet. 2019;36:1441–1448. DOI: 10.1007/s10815-019-01498-7
5. Ruey-Shiang Guh, Tsung-Chieh Jackson Wu, Shao-Ping Weng. Integrating genetic algorithm and decision tree learning for assistance in predicting in vitro fertilization outcomes. Expert Systems with Applications. 2011;38(4):4437–4449. DOI: 10.1016/j.eswa.2010.09.112
6. Hassan M.R., Al-Insaif S., Hossain M.I., Kamruzzaman J. A machine learning approach for prediction of pregnancy outcome following IVF treatment. Neural Comput & Applic. 2020;32:2283–2297. DOI: 10.1007/s00521-018-3693-9
7. Hafiz P., Nematollahi M., Boostani R., Namavar Jahromi B. Predicting Implantation Outcome of In Vitro Fertilization and Intracytoplasmic Sperm Injection Using Data Mining Techniques. Int J Fertil Steril. 2017;11(3):184–190. DOI: 10.22074/ijfs.2017.4882
8. Raef B, Ferdousi R. A Review of Machine Learning Approaches in Assisted Reproductive Technologies. Acta Inform Med. 2019;27(3):205–211. DOI:10.5455/aim.2019.27.205-211
9. Guyon I, Elisseeff A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003;3:1157–1182.
10. Guyon, I., Weston, J., Barnhill, S., Vapnik V. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning. 2002;46:389–422. DOI: 10.1023/A:1012487302797
11. Saeys Y., Inza I., Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507–2517. DOI: 10.1093/bioinformatics/btm344
12. Voroncov K. V. Lekcii po metodam ocenivanija i vybora modelej. Available at: http://www.ccas.ru/voron/download/Modeling.pdf (accessed 18.08.2020) (In Russ)
13. Altmann A., Toloşi L., Sander O., Lengauer T. Permutation importance: a corrected feature importance measure. Bioinformatics. 2010;26(10):1340–1347. DOI: 10.1093/bioinformatics/btq134
14. Kenji K., Rendell A. L. The feature selection problem: traditional methods and a new algorithm. AAAI. 1992;129–134
15. Kursa, M., Rudnicki. Feature Selection with the Boruta Package. Journal of Statistical Software. 2010;36(11):1–13. DOI: 10.18637/jss.v036.i11
16. Mazaheri V., Khodadadi H. Heart arrhythmia diagnosis based on the combination of morphological, frequency and nonlinear features of ECG signals and metaheuristic feature selection algorithm. Expert Systems with Applications. 2020;161:113697. DOI: 10.1016/j.eswa.2020.113697
17. Faris H., Mafarja M.M., Heidari A.A., Aljarah I., Al-Zoubi A.M., Mirjalili S., Fujita H. An efficient binary Salp Swarm Algorithm with crossover scheme for feature selection problems. Knowledge-Based Systems. 2018:154;43–67. DOI: 10.1016/j.knosys.2018.05.009
18. He H., Bai Y., Garcia E.A., Li S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). 2008;1322–1328. DOI: 10.1109/IJCNN.2008.4633969
19. Lemaître G., Nogueira F., Aridas C.K. Imbalanced-learn: Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. JMLR. 2017;18(17):1−5.
20. Glantz S. Primer of biostatistics. М.: Practica;1998. (In Russ)
21. Rothman K.J. A Show of Confidence. N Engl J Med. 1978;299(24):1362−1363. DOI: 10.1056/NEJM197812142992410
22. Das A.K., Kumar S., Jain S., Goswami S., Chakrabarti A., Chakraborty B. An informationtheoretic graph-based approach for feature selection. Sādhanā. 2020;45:11. DOI: 10.1007/s12046-019-1238-2
23. Battiti R. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks. 1994;5(4):537−550. DOI: 10.1109/72.298224
24. Kononenko I. Estimating attributes: Analysis and extensions of RELIEF. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence). 1994;784:171−182.
25. Robnik-Sikonja M., Kononenko I. An adaptation of Relief for attribute estimation in regression. ICML '97: Proceedings of the Fourteenth International Conference on Machine Learning. 1997;296–304.
26. Hamon J. Optimisation combinatoire pour la sélection de variables en régression en grande dimension: Application en génétique animale. Applications [stat.AP]. Université des Sciences et Technologie de Lille - Lille I, 2013. Français. fftel-00920205
27. Implementation of the algorithm RFECV in Scikit-learn. Available at: https://scikitlearn.org/stable/modules/generated/sklearn.feature_selection.RFECV. html#sklearn.feature_selection.RFECV (accessed 18.08.2020)
28. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay É. Scikit-learn: Machine Learning in Python. JMLR. 2011;12(85):2825−2830.
29. Natekin A. Gradientnyj busting: vozmozhnosti, osobennosti i fishki za predelami standartnyh kaggle-style zadach. Moscow Data Science Meetup. 2017. Available at: https://www.youtube.com/watch?time_continue=746&v=cM2c47Xlqk&feature=emb_logo (accessed 18.08.2020) (In Russ)
30. Shitikov V. K., Mastickij S. Je. Klassifikacija, regressija, algoritmy Data Mining s ispol'zovaniem R. 2017. Available at: https://github.com/ranalytics/data-mining (accessed 18.08.2020) (In Russ)
31. ELI5 library. Available at: https://eli5.readthedocs.io/en/latest/index.html# (accessed 18.08.2020)
32. Anaconda - solutions for Data Science Practitioners and Enterprise Machine Learning. Available at: https://www.anaconda.com (accessed 18.08.2020)
33. SciPy library. Available at: https://www.scipy.org/index.html (accessed 18.08.2020)
34. ReliefF library. Available at: https://pypi.org/project/ReliefF/#description (accessed 18.08.2020)
35. LightGBM library. Available at: https://lightgbm.readthedocs.io/en/latest/index.html# (accessed 18.08.2020)
36. Grellier O. Feature Selection with Null Importances. Article on the Kaggle. Available at: https://www.kaggle.com/ogrellier/feature-selection-with-null-importances (accessed 18.08.2020)
37. Boruta implementation in Python. Available at: https://github.com/scikit-learncontrib/boruta_py (accessed 18.08.2020)
38. NumPy library. Available at: https://numpy.org/ (accessed 18.08.2020)
39. Pandas library. Available at: https://pandas.pydata.org/ (accessed 18.08.2020)
40. Matplotlib library. Available at: https://matplotlib.org/index.html (accessed 18.08.2020)
41. Seaborn library. Available at: https://seaborn.pydata.org/# (accessed 18.08.2020)
42. Bergstra, J., Yamins D., Cox D.D. Making a Science of Model Search: Hyperparameter Optimizationin Hundreds of Dimensions for Vision Architectures. JMLR Workshop and Conference Proceedings. 2013;28(1):115–123.
43. Grjibovski А.М. Analysis of nominal data (independent observations). Human Ecology. 2008;6:58-68. (In Russ)
44. Ng A. Machine Learning Yearning. Available at: https://www.mlyearning.org/ (accessed 18.08.2020)
Keywords: feature selection, binary classification problem, small data analysis, machine learning, assisted reproductive technologies
For citation: Sinotova S.L., Limanovskaya O.V., Plaksina A.N., Makutina V.A. Comparison of the efficiency of different selecting features methods for solving the binary classification problem of predicting in vitro fertilization pregnancy. Modeling, Optimization and Information Technology. 2020;8(3). URL: https://moit.vivt.ru/wp-content/uploads/2020/08/SinotovaSoavtors_3_20_1.pdf DOI: 10.26102/2310-6018/2020.30.3.025 (In Russ).
Published 30.09.2020