Recognizing function prologues in binary files with recurrent neural networks

Shaykhanov A.S.

UDC 004.056
DOI: 10.26102/2310-6018/2026.56.5.008

Abstract
List of references
About authors

The article discusses the problem of recognising function beginnings in binary files, which is one of the key subtasks of software reverse engineering. The relevance of the research is due to the limitations of traditional deterministic methods based on heuristics, signature analysis, and control flow graph analysis, as well as the insufficient versatility of existing neural network solutions, which are primarily focused on x86 and x86-64 architectures. The aim of the work is to develop and experimentally evaluate a machine learning model capable of effectively recognising function starts in binary files compiled for alternative machine architectures, taking into account the applied specifics of reverse engineering tasks. The basic approach proposed is to use a recurrent neural network that processes byte sequences of a binary file. A comparative analysis of existing neural network models for function recognition was conducted, and their advantages and limitations were identified, which made it possible to justify the choice of a simple and reproducible RNN architecture. The study examined in detail the impact of key model hyperparameters, including the length of the input sequence, the number of neurons in the recurrent layer, and the weights of the loss function, on the quality of recognition. The experiments were performed on binary files of the ESP32 microcontroller with Xtensa Little Endian architecture and STM32WBA6 microcontroller of Cortex-M33 core with ARMv8-M architecture using both standard and random alignment, which made it possible to evaluate the model's resistance to changes in the structure of binary data. The results show that the length of the input sequence is the most significant hyperparameter, while the influence of the loss function weights is secondary. It has been established that the model does not generalise between different types of alignments, which requires preliminary analysis of the binary file before application. Based on the developed model, an extension for the IDA Pro disassembler has been implemented, demonstrating the practical applicability of the proposed approach in real reverse engineering tasks.

1. Wartell R., Zhou Y., Hamlen K.W., Kantarcioglu M., Thuraisingham Bh. Differentiating code from data in x86 binaries. In: Machine Learning and Knowledge Discovery in Databases: Proceedings: Part III: European Conference (ECML PKDD 2010), 05–09 September 2011, Athens, Greece. Berlin, Heidelberg: Springer; 2011. P. 522–536. https://doi.org/10.1007/978-3-642-23808-6_34

2. Benkraouda H., Diwan N., Wang G. You Can't Judge a Binary by Its Header: Data-Code Separation for Non-Standard ARM Binaries Using Pseudo Labels. In: 2025 IEEE Symposium on Security and Privacy (SP), 12–15 May 2025, San Francisco, CA, USA. IEEE; 2025. P. 3727–3745. https://doi.org/10.1109/SP61157.2025.00036

3. Qin S., Yang F., Wang H., et al. Tady: A Neural Disassembler without Structural Constraint Violations. arXiv. URL: https://arxiv.org/pdf/2506.13323 [Accessed 28th January 2026].

4. David Y., Alon U., Yahav E. Neural Reverse Engineering of Stripped Binaries using Augmented Control Flow Graphs. Proceedings of the ACM on Programming Languages. 2020;4. https://doi.org/10.1145/3428293

5. Jiang L., Jin X., Lin Zh. Beyond Classification: Inferring Function Names in Stripped Binaries via Domain Adapted LLMs. In: 32nd Annual Network and Distributed System Security Symposium (NDSS 2025), 24–28 February 2025, San Diego, California, USA. The Internet Society; 2025. https://doi.org/10.14722/ndss.2025.240797

6. Bao T., Burket J., Woo M., Turner R., Brumley D. BYTEWEIGHT: Learning to Recognize Functions in Binary Code. In: 23rd USENIX Security Symposium, 20–22 August 2014, San Diego, CA, USA. USENIX Association; 2014. P. 845–860.

7. He J., Li Sh., Wang X., et al. Neural-FEBI: Accurate Function Identification in Ethereum Virtual Machine Bytecode. Journal of Systems and Software. 2023;199. https://doi.org/10.1016/j.jss.2023.111627

8. Pei K., Guan J., Broughton M., et al. StateFormer: Fine-Grained Type Recovery from Binaries using Generative State Modeling. In: ESEC/FSE '21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 23–28 August 2021, Athens, Greece. New York: ACM; 2021. P. 690–702. https://doi.org/10.1145/3468264.3468607

9. Nitin V., Saieva A., Ray B., Kaiser G. DIRECT: A Transformer-based Model for Decompiled Identifier Renaming. In: Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021), 01–06 August 2021, Virtual Event. Association for Computational Linguistics; 2021. P. 48–57. https://doi.org/10.18653/v1/2021.nlp4prog-1.6

10. Wang H., Qu W., Katz G., et al. jTrans: Jump-Aware Transformer for Binary Code Similarity Detection. In: ISSTA '22: 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 18–22 July 2022, Virtual Event. New York: ACM; 2022. P. 1–13. https://doi.org/10.1145/3533767.3534367

11. Yu Z., Cao R., Tang Q., et al. Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection. Proceedings of the AAAI Conference on Artificial Intelligence. 2020;34(01):1145–1152. https://doi.org/10.1609/aaai.v34i01.5466

12. Duan Y., Li X., Wang J., Yin H. DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing. In: 27th Annual Network and Distributed System Security Symposium (NDSS 2020), 23–26 February 2020, San Diego, California, USA. The Internet Society; 2020. https://doi.org/10.14722/ndss.2020.24311

13. Li X., Qu Y., Yin H. PalmTree: Learning an Assembly Language Model for Instruction Embedding. In: CCS '21: 2021 ACM SIGSAC Conference on Computer and Communications Security, 15–19 November 2021, Virtual Event. New York: ACM; 2021. P. 3236–3251. https://doi.org/10.1145/3460120.3484587

14. Gao Z., Wang H., Wang Y., Zhang Ch. Virtual Compiler Is All You Need For Assembly Code Search. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics: Volume 1: Long Papers, 11–16 August 2024, Bangkok, Thailand. Association for Computational Linguistics; 2024. P. 3040–3051. https://doi.org/10.18653/v1/2024.acl-long.167

15. Liu Ch., Saul R., Sun Y., et al. ASSEMBLAGE: Automatic Binary Dataset Construction for Machine Learning. In: Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024 (NeurIPS 2024), 10–15 December 2024, Vancouver, BC, Canada. 2024. https://openreview.net/pdf?id=dsK5EmmomU

16. Andriesse D., Slowinska A., Bos H. Compiler-agnostic function detection in binaries. In: 2017 IEEE European Symposium on Security and Privacy, 26–28 April 2017, Paris, France. IEEE; 2017. P. 177–189. https://doi.org/10.1109/EuroSP.2017.11

17. Flores-Montoya A., Schulte E.M. Datalog disassembly. In: 29th USENIX Security Symposium (USENIX Security 2020), 12–14 August 2020. USENIX Association; 2020. P. 1075–1092. https://www.usenix.org/system/files/sec20-flores-montoya.pdf

18. Shin E.Ch.R., Song D., Moazzezi R. Recognizing Functions in Binaries with Neural Networks. In: 24th USENIX Security Symposium (USENIX Security 15), 12–14 August 2015, Washington, D.C., USA. USENIX Association; 2015. P. 611–626.

19. Pei K., Guan J., Williams-King D., Yang J., Jana S. XDA: Accurate, robust disassembly with transfer learning. In: 28th Annual Network and Distributed System Security Symposium (NDSS 2021), 21–25 February 2021, Virtual Event. The Internet Society; 2021. https://doi.org/10.14722/ndss.2021.23112

20. Yu Sh., Qu Y., Hu X., Yin H. DeepDi: Learning a relational graph convolutional network model on instructions for fast and accurate disassembly. In: 31st USENIX Security Symposium (USENIX Security 2022), 10–12 August 2022, Boston, MA, USA. USENIX Association; 2022. P. 2709–2725.

21. Evans R., Hawkins W., Wang B. RustBound: Function Boundary Detection over Rust Stripped Binaries. In: Security and Privacy in Cyber-Physical Systems and Smart Vehicles: Second EAI International Conference (SmartSP 2024), 07–08 November 2024, New Orleans, LA, USA. Cham: Springer; 2025. P. 237–256. https://doi.org/10.1007/978-3-031-93354-7_11

22. Guo W., Mu D., Xu J., et al. LEMNA: Explaining Deep Learning based Security Applications. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS 2018), 15–19 October 2018, Toronto, ON, Canada. New York: ACM; 2018. P. 364–379. https://doi.org/10.1145/3243734.3243792

23. Springer R., Schmitz A., Leinweber A., Urban T., Dietrich Ch. Padding Matters – Exploring Function Detection in PE Files. arXiv. URL: https://arxiv.org/abs/2504.21520 [Accessed 9th February 2026].

24. Bundt J., Davinroy M., Agadakos I., Oprea A., Robertson W.K. Black-box Attacks Against Neural Binary Function Detection. In: Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2023), 16–18 October 2023, Hong Kong, China. New York: ACM; 2023. https://doi.org/10.1145/3607199.3607200

Shaykhanov Artem Serikovich

Email: artem.shaykhanov@gmail.com

Bauman Moscow State Technical University

Moscow, Russian Federation

Keywords: reverse engineering, function recognition, binary file, function prologue, recurrent neural network, IDA Pro

For citation: Shaykhanov A.S. Recognizing function prologues in binary files with recurrent neural networks. Modeling, Optimization and Information Technology. 2026;14(5). URL: https://moitvivt.ru/ru/journal/article?id=2273 DOI: 10.26102/2310-6018/2026.56.5.008 (In Russ).

114

Full text in PDF

Скачать JATS XML

Received 18.03.2026

Revised 06.05.2026

Accepted 13.05.2026

Published 31.05.2026