Keywords: text authorship attribution, fastText, machine learning, text analysis, information security
Text authorship identification for open set of candidates in cybersecurity context
UDC 004.89
DOI: 10.26102/2310-6018/2024.44.1.012
The paper considers the methods of authorship identification for fanfiction texts based on popular works of literature and cinema. The data for the study include texts from 5 popular topics of Ficbook online library. The most common is the closed set attribution task. Regarding practical issues, it can be assumed that the true author of an anonymous text will not always be included in the candidates set. Therefore, the process of author identification was regarded as a more complex version of the typical classification problem – the open set of authors. The proposed methods are based on the machine learning methods: fastText and One-Class SVM with informative features selection and statistical approaches of vector representation similarity measures. Statistical methods have proven to be the least effective even for the simple cross-thematic case. In comparison with the method based on One-Class SVM, the difference in accuracy reaches 15 %. For cross-thematic attribution, the average accuracy of the method based on the combination of One-Class SVM with feature selection and fastText was 85 %, while for the more complex task – classification within a group – it ranged from 75 to 78 % depending on the thematic group.
1. Romanov A., Kurtukova A., Shelupanov A., Fedotova A., Goncharov V. Authorship identification of a Russian-language text using support vector machine and deep neural networks. Future Internet. 2020;13(1):3. DOI: 10.3390/fi13010003.
2. Romanov A., Kurtukova A., Sobolev A., Shelupanov A., Fedotova A. Determining the age of the author of the text based on deep neural network models. Information. 2020;12(11):589. DOI: 10.3390/info11120589.
3. Jafariakinabad F., Kien A.H. Unifying lexical, syntactic, and structural representations of written language for authorship attribution. SN Computer Science. 2021;6(2):481. DOI: 10.1007/s42979-021-00911-2.
4. Mahor U., Aarti K. A comparative study of stylometric characteristics in authorship attribution. Information and Communication Technology for Competitive Strategies (ICTCS 2021) ICT: Applications and Social Interfaces. Singapore, Springer Nature Singapore. 2022. p. 71–81. DOI: 10.1007/978-981-19-0095-2.
5. Fedotova A., Romanov A., Kurtukova A., Shelupanov A. Authorship attribution of social media and literary Russian-language texts using machine learning methods and feature selection. Future Internet. 2021;14(1):4. DOI: 10.3390/fi14010004.
6. PAN: series of scientific events and shared tasks on digital text forensics and stylometry. URL: https://pan.webis.de (accessed on 19.01.2024).
7. The 100 Idiolectic Project. URL: https://fold.aston.ac.uk/handle/123456789/17 (accessed on 19.01.2024).
8. Najafi M., Tavan E. Text-to-text transformer in authorship verification via stylistic and semantical analysis. Proceedings of the CLEF. 2022. URL: https://ceur-ws.org/Vol-3180/paper-215.pdf (accessed on 19.01.2024).
9. Drozdova A., Petrov V. Modern сlassic in the web environment: narrative variations of V. Nabokov’s in fanfiction. Acta Universitatis Sapientiae, Film and Media Studies. 2020;18(1):89–107. DOI: 10.2478/ausfm-2020-0005.
10. Shafirova L., Cassany D., Bach C. Transcultural literacies in online collaboration: a case study of fanfiction translation from Russian into English. Language and Intercultural Communication. 2020;20(6):531–545. DOI: 10.1080/14708477.2020.1812621.
11. Swain S., Mishra G., Sindhu C. Recent approaches on authorship attribution techniques –an overview. In: 2017 International conference of Electronics, Communication and Aerospace Technology (ICECA). IEEE. Coimbatore, India. 2017. p. 557–566. DOI: 10.1109/iceca.2017.8203599.
12. Hedegaard S., Simonsen J.G. Lost in translation: Authorship attribution using frame semantics. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011. p. 65–70. URL: https://aclanthology.org/P11-2012.pdf (дата обращения: 19.01.2024).
13. Sokolova T.P. Problems of expert identification in forensic authorship. Courier of Kutafin Moscow State Law University (MSAL) = Vestnik Universiteta imeni O.E. Kutafina (MGYuA). 2022;2(90):67–76. (In Russ).
14. Ficbook: Fanfiction book. URL: https://ficbook.net/ (accessed on 19.01.2024).
15. Romanov A.S. Feature selection methods for authorship attribution in cybersecurity context. Modelirovanie, optimizatsiya i informatsionnye tekhnologii = Modeling, Optimization and Information Technology. 2024;12(1). URL: https://moitvivt.ru/ru/journal/pdf?id=1489. DOI: 10.26102/2310-6018/2024.44.1.001. (In Russ.).
16. Mohammed A.A., Umaashankar V. Effectiveness of hierarchical softmax in large scale classification tasks. In: 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), IEEE. 2018. p. 1090–1094. DOI: 10.1109/ICACCI.2018.8554637.
17. Lei K., Fu Q., Yang M., Liang Y. Tag recommendation by text classification with attention-based capsule network. Neurocomputing. 2020;391:65–73. DOI: 10.1016/j.neucom.2020.01.091.
18. Suwanda R., Syahputra Z., Zamzami E.M. Analysis of Euclidean distance and Manhattan distance in the K-means algorithm for variations number of centroid K. Journal of Physics: Conference Series, IOP Publishing. 2020;1566(1):012058. DOI: 10.1088/1742-6596/1566/1/012058.
19. Martín-del-Campo-Rodríguez C., Sidorov G., Batyrshin I. Unsupervised authorship attribution using feature selection and weighted cosine similarity. Journal of Intelligent & Fuzzy Systems. 2022;42(5):4357–4367.
20. Park K., Hong J.S., Kim W. A methodology combining cosine similarity with classifier for text classification. Applied Artificial Intelligence. 2020;34(5):396–411. DOI: 10.1080/08839514.2020.1723868.
Keywords: text authorship attribution, fastText, machine learning, text analysis, information security
For citation: Romanov A.S. Text authorship identification for open set of candidates in cybersecurity context. Modeling, Optimization and Information Technology. 2024;12(1). URL: https://moitvivt.ru/ru/journal/pdf?id=1510 DOI: 10.26102/2310-6018/2024.44.1.012 (In Russ).
Received 24.01.2024
Revised 08.02.2024
Accepted 20.02.2024
Published 31.03.2024