Feature selection methods for authorship attribution in cybersecurity context

idRomanov A.S.

UDC 004.89
DOI: 10.26102/2310-6018/2024.44.1.001

Abstract
List of references
About authors

This paper considers methods for authorship attribution of natural-language and artificially generated texts, which are important in the context of cybersecurity and intellectual property protection to prevent misinformation and fraud. The use of authorship methods is justified by the findings on the fastText and support vector method (SVM) effectiveness discussed in past studies. The feature selection algorithm is chosen based on the comparison of five different methods: genetic algorithm, forward and backward sequential methods, regularization selection and Shapley's method. The considered selection algorithms include heuristic methods, game theory elements and iterative algorithms. The regularisation-based algorithm is found to be the most efficient method, while methods based on complete brute-force selection are found to be inefficient for any set of authors. The regularization-based and SVM-based selection accuracy averaged 77 %, outperforming the other methods by between 3 and 10 % for an identical number of features. For the same tasks, the average accuracy of fastText is 84 %. A study was conducted to examine the robustness of the developed approach to generative samples. SVM proved to be more robust to model confounding. The maximum loss of accuracy for fastText was 16 % and for SVM was 12 %.

1. Romanov A., Kurtukova A., Shelupanov A., Fedotova A., Goncharov V. Authorship identification of a Russian-language text using support vector machine and deep neural networks. Future Internet. 2020;13(1):3. DOI: 10.3390/fi13010003.

2. Fedotova A., Romanov A., Kurtukova A., Shelupanov A. Authorship attribution of social media and literary Russian-language texts using machine learning methods and feature selection. Future Internet. 2021;14(1):4. DOI: 10.3390/fi14010004.

3. Wu H., Zhang Z., Wu Q. Exploring syntactic and semantic features for authorship attribution. Applied Soft Computing. 2021;111:107815–107822. DOI: 10.1016/j.asoc.2021.107815.

4. Khomytska I., Bazylevych I., Teslyuk V. The statistical parameters of Ivan Franko’s authorial style determined by the chi-square test. 2022 IEEE 17th International Conference on Computer Sciences and Information Technologies (CSIT), 2022. p. 73–76. DOI: 10.1109/CSIT56902.2022.10000491.

5. Chekhovich Y. V., Khazov A. V. Analysis of duplicated publications in Russian journals. Journal of informetrics. 2022;16(1):101246. DOI: 10.1016/j.joi.2021.101246.

6. Isachenko V. V., Apanovich Z. Analysis and visualisation system for cross-lingual identification of authors of scientific publications. Vestnik Novosibirskogo gosudarstvennogo universiteta. Serija: Informacionnye tehnologii = Vestnik NSU. Series: Information Technologies. 2018; 16(2):49-61. DOI: 10.25205/1818-7900-2018-16-2-49-61 (In Russ.).

7. Agun H.V., Yilmazel O. Incorporating topic information in a global feature selection schema for authorship attribution. IEEE Access. 2019;7:98522–98529 DOI: 10.1109/ACCESS.2019.2930536.

8. Kou G., Yang P., Peng Y., Xiao F., Chen Y., Alsaadi F.E. Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Applied Soft Computing. 2020;86:105836. DOI: 10.1016/j.asoc.2019.105836.

9. Bardamova M., Hodashinsky I. Hybrid algorithm for tuning feature weights in a fuzzy classifier. 2021 Ural Symposium on Biomedical Engineering, Radioelectronics and Information Technology (USBEREIT), Yekaterinburg, Russia, 2021. p. 0354–0357. DOI: 10.1109/USBEREIT51232.2021.9455030.

10. Yaseen A., Laftah W., Kadhum I., Hamad A. Wrapper feature selection method based differential evolution and extreme learning machine for intrusion detection system. Pattern Recognition. 2022;108912. DOI: 10.1016/j.patcog.2022.108912.

11. Uchendu A., Le T., Lee D. Attribution and obfuscation of neural text authorship: A data mining perspective. ACM SIGKDD Explorations Newsletter. 2023;25(1):1–18. DOI: 10.48550/arXiv.2210.10488.

12. Shamardina T. et al. Findings of the the ruatd shared task 2022 on artificial text detection in Russian. arXiv preprint arXiv:2206;2022;01583. DOI: 10.48550/arXiv.2206.01583.

13. Xu W., Yuan, K., Li, W., Ding, W. An emerging fuzzy feature selection method using composite entropy-based uncertainty measure and data distribution. IEEE Transactions on Emerging Topics in Computational Intelligence. 2022;7(1):76–88. DOI: 10.1109/TETCI.2022.3171784.

14. Yao G., Xiaojian H., Guanxiong W. A novel ensemble feature selection method by integrating multiple ranking information combined with an SVM ensemble model for enterprise credit risk prediction in the supply chain. Expert Systems with Applications. 2022;117002. DOI: 10.1016/j.eswa.2022.117002.

15. Abu Khurma R., Aljarah I., Sharieh A., Abd Elaziz M., Damaševičius R., Krilavičius T. A review of the modification strategies of the nature inspired algorithms for feature selection problem. Mathematics 2022;10(464). DOI: 10.3390/math10030464.

16. Borboudakis G., Tsamardinos I. Forward-backward selection with early dropping. The Journal of Machine Learning Research; 2019:20(1):276–314. DOI: 10.5555/3322706.3322714.

17. Le N.Q.K., Ho Q.T., Nguyen V.N., Chang J.S. BERT-Promoter: An improved sequence-based predictor of DNA promoter using BERT pre-trained model and SHAP feature selection. Computational Biology and Chemistry. 2022;99:107732. DOI: 10.1016/j.compbiolchem.2022.107732.

18. New frequency dictionary of Russian vocabulary. URL: http://dict.ruslang.ru/freq.php (accessed on 04.12.2023). (In Russ.).

Romanov Aleksandr Sergeevich
Сandidate of Engineering Sciences, Associate Professor

ORCID |

Tomsk State University of Control Systems and Radioelectronics

Tomsk, the Russian Federation

Keywords: feature selection, authorship attribution, machine learning, neural networks, text analysis, information security

Sources of funding: This research was supported by the Ministry of Science and Higher Education of the Russian Federation the basic part of the state assignment of TUSUR for 2023–2025 (project No. FEWM-2023-0015).

For citation: Romanov A.S. Feature selection methods for authorship attribution in cybersecurity context. Modeling, Optimization and Information Technology. 2024;12(1). URL: https://moitvivt.ru/ru/journal/article?id=1489 DOI: 10.26102/2310-6018/2024.44.1.001 (In Russ).

1063

Full text in PDF

Скачать JATS XML

Received 06.12.2023

Revised 20.12.2023

Accepted 16.01.2024

Published 31.03.2024