Keywords: authorship, source code, commits, generation, neural network
Authorship identification of a heterogeneous source code for the purposes of cybersecurity management
UDC 004.89
DOI: 10.26102/2310-6018/2022.38.3.016
The article is devoted to the issue of identifying the author of a heterogeneous source code program by means of a hybrid neural network. The solutions to this problem are especially relevant to the fields of information security, educational process, and copyright protection. The article analyzes modern methods of addressing this problem. The authors propose their own methodology based on a proven in early studies hybrid neural network aimed at evaluating the effectiveness of this approach in simple and difficult cases. This research incorporates experiments on previously unconsidered cases of source code author identification based on heterogeneous data. Cases relevant to corporate development are examined including the analysis of source codes presented as commits and model training on datasets with more than two programming languages. Additionally, the trend of determining the authorship of an artificially generated source code, which is gaining traction, is regarded. A dataset was generated, and an appropriate experiment was performed for each case. The effectiveness of the author's methodology for all three difficult cases was evaluated using a 10 blocks cross-validation. The average accuracy for mixed datasets was 87 % for two programming languages and 76 % for three or more languages, respectively. The average accuracy of the methodology for authorship identification of artificially generated source codes was 81.5 %. Identification of the author of a program source code based on commits was carried out with an accuracy of 84 %. Experiments have shown that the effectiveness of the methodology can be improved in all three cases by using large amounts of training data.
1. Kurtukova, A.V., Romanov, A.S. Identifikaciya avtora iskhodnogo koda metodami mashinnogo obucheniya. Trudy SPIIRAN = Proceedings of SPIIRAS. 2019;18:741–765. (In Russ.).
2. Kurtukova A., Romanov A., Shelupanov A. Source Code Authorship Identification Using Deep Neural Networks. Symmetry. 2020;12:2044.
3. Abuhamad M., AbuHmed T., Mohaisen A., Nyang D. Large-Scale and Language-Oblivious Code Authorship Identification. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. 2018;101–114.
4. Zhen L., Chen G., Chen C., Zou Y., Xu S. RoPGen: Towards Robust Code Authorship Attribution via Automatic Coding Style Transformation. 2022 IEEE 44th International Conference on Software Engineering (ICSE). 2022;1906–1918.
5. Holland C., Khoshavi N., Jaimes L.G. Code authorship identification via deep graph CNNs. In Proceedings of the 2022 ACM Southeast Conference (ACM SE '22). 2022;144–150.
6. Bogdanova A., Romanov V. Explainable source code authorship attribution algorithm. Journal of Physics: Conference Series. 2021;2134:012011. DOI: 10.1088/1742-6596/2134/1/012011.
7. Bogdanova A. Source code authorship attribution using file embeddings. Companion Proceedings of the 2021 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity. 2021;31–33.
8. Bogomolov E., Kovalenko V., Rebryk Y., Bacchelli A., Bryksin T. Authorship attribution of source code: a language-agnostic approach and ap-plicability in software engineering. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2021;932–944.
9. Romanov A., Kurtukova A., Fedotova A., Meshcheryakov R. Natural Text Anonymization Using Universal Transformer with a Self-attention. Proceedings of the III International Conference on Language Engineering and Applied Linguistics. 2019;22–37
10. Caliskan-Islam A. Deanonymizing programmers via code stylometry. Proceedings of the 24th USENIX Security Symposium. 2015;255–270.
11. GitHub. Available at: https://GitHub.com/ (accessed 08.14.2022).
12. Gitlab. Available at: https://gitlab.com/ (accessed 08.14.2022).
13. Rothe S., Narayan S., Severyn A. Leveraging pre-trained checkpoints for sequence generation tasks. Transactions of the Association for Computational Linguistics. 2020;8:264–280.
14. Du Z. All NLP tasks are generation tasks: A general pretraining framework. arXiv preprint arXiv:2103.10360. 2021.
15. Floridi L., Chiriatti M. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines. 2020;30(4):681–694.
16. Lee J. S., Hsiang J. Patent claim generation by fine-tuning OpenAI GPT-2. World Patent Information. 2020;62:101983.
17. Dusheiko A. Lead generation of news texts using the ruGPT-3 neural network: Master's thesis in the field of preparation. 45.04.03 Fundamental and applied linguistics. 2022. (In Russ.).
18. Pisarevskaya D., Shavrina T. WikiOmnia: generative QA corpus on the whole Russian Wikipedia. arXiv preprint arXiv:2204.08009. 2022.
19. Li Z. RoPGen: towards robust code authorship attribution via automatic coding style transformation. 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE). IEEE. 2022;1906–1918.
20. Cruz-Benito J. Automated source code generation and auto-completion using deep learning: Comparing and discussing current language model-related approaches. AI. 2021;2(1):1–16.
21. Open AI. Available at: https://openai.com/blog/openai-codex (accessed 08.14.2022).
22. GitHub Copilot. Available at: https://copilot.GitHub.com (accessed 08.14.2022).
23. AlphaCode. Available at: https://deepmind.com/blog/article/Competitive-programming-with-AlphaCode (accessed 08.14.2022).
24. Sber AI ruGPT-3. Available at: https://developers.sber.ru/portal/tools/rugpt-3 (accessed 08.14.2022).
25. Polycoder. Available at: https://venturebeat.com/2022/03/04/researchers-open-source-code-generating-ai-they-claim-can-beat-openais-codex/ (accessed 08.14.2022).
26. Frantzeskou G., Stamatatos E., Gritzalis S. Identifying authorship by bytelevel n-grams: The source code author profile (SCAP) method. Int. J. Digit. Evid. 2007;1:1–18.
27. Wisse W., Veenman C.J. Scripting DNA: Identifying the JavaScript Programmer. Digit. Investig. 2015;15:61–71.
Keywords: authorship, source code, commits, generation, neural network
For citation: Romanov A.S., Kurtukova A.V., Shelupanov A.A., Fedotova A.M. Authorship identification of a heterogeneous source code for the purposes of cybersecurity management. Modeling, Optimization and Information Technology. 2022;10(3). URL: https://moitvivt.ru/ru/journal/pdf?id=1227 DOI: 10.26102/2310-6018/2022.38.3.016 (In Russ).
Received 04.09.2022
Revised 18.09.2022
Accepted 24.09.2022
Published 30.09.2022