References

moitvivt

Моделирование, оптимизация и информационные технологии

Modeling, Optimization and Information Technology

2310-6018

Издательство

10.26102/2310-6018/2026.56.5.008

2273

Распознавание начала функций в бинарных файлах с использованием рекуррентных нейронных сетей

Recognizing function prologues in binary files with recurrent neural networks

Шайханов

Артем Серикович

Shaykhanov

Artem Serikovich

artem.shaykhanov@gmail.com aff-1

Московский государственный технический университет им. Н.Э. Баумана Bauman Moscow State Technical University

01 01 2026

1 1

10.26102/2310-6018/2026.56.5.008

2026

This work is licensed under a Creative Commons Attribution 4.0 International License

В статье рассматривается задача распознавания начал функций в бинарных файлах, которая является одной из ключевых подзадач реверс-инжиниринга программного обеспечения. Актуальность исследования обусловлена ограничениями традиционных детерминированных методов, основанных на эвристиках, сигнатурном анализе и анализе графов потока управления, а также недостаточной универсальностью существующих нейросетевых решений, ориентированных преимущественно на архитектуры x86 и x86–64. Целью работы является разработка и экспериментальная оценка модели машинного обучения, способной эффективно распознавать начала функций в бинарных файлах, собранных под альтернативные машинные архитектуры, с учетом прикладной специфики задач обратной разработки. В качестве базового подхода предложено использование рекуррентной нейронной сети, обрабатывающей последовательности байтов бинарного файла. Проведен сравнительный анализ существующих нейросетевых моделей распознавания функций, выявлены их преимущества и ограничения, что позволило обосновать выбор простой и воспроизводимой архитектуры RNN. В рамках исследования детально изучено влияние ключевых гиперпараметров модели, включая длину входной последовательности, количество нейронов в рекуррентном слое и веса функции потерь, на качество распознавания. Эксперименты выполнены на бинарных файлах микроконтроллеров ESP32 архитектуры Xtensa Little Endian и STM32WBA6 с ядром Cortex-M33 архитектуры ARMv8-M с использованием как стандартного, так и случайного выравниваний, что позволило оценить устойчивость модели к изменению структуры бинарных данных. Результаты показывают, что длина входной последовательности является наиболее значимым гиперпараметром, в то время как влияние весов функции потерь носит вторичный характер. Установлено, что модель не обладает обобщаемостью между различными типами выравниваний, что требует предварительного анализа бинарного файла перед применением. На основе разработанной модели реализовано расширение для дизассемблера IDA Pro, демонстрирующее практическую применимость предложенного подхода в реальных задачах реверс-инжиниринга.

The article discusses the problem of recognising function beginnings in binary files, which is one of the key subtasks of software reverse engineering. The relevance of the research is due to the limitations of traditional deterministic methods based on heuristics, signature analysis, and control flow graph analysis, as well as the insufficient versatility of existing neural network solutions, which are primarily focused on x86 and x86-64 architectures. The aim of the work is to develop and experimentally evaluate a machine learning model capable of effectively recognising function starts in binary files compiled for alternative machine architectures, taking into account the applied specifics of reverse engineering tasks. The basic approach proposed is to use a recurrent neural network that processes byte sequences of a binary file. A comparative analysis of existing neural network models for function recognition was conducted, and their advantages and limitations were identified, which made it possible to justify the choice of a simple and reproducible RNN architecture. The study examined in detail the impact of key model hyperparameters, including the length of the input sequence, the number of neurons in the recurrent layer, and the weights of the loss function, on the quality of recognition. The experiments were performed on binary files of the ESP32 microcontroller with Xtensa Little Endian architecture and STM32WBA6 microcontroller of Cortex-M33 core with ARMv8-M architecture using both standard and random alignment, which made it possible to evaluate the model's resistance to changes in the structure of binary data. The results show that the length of the input sequence is the most significant hyperparameter, while the influence of the loss function weights is secondary. It has been established that the model does not generalise between different types of alignments, which requires preliminary analysis of the binary file before application. Based on the developed model, an extension for the IDA Pro disassembler has been implemented, demonstrating the practical applicability of the proposed approach in real reverse engineering tasks.

реверс-инжиниринг распознавание функций бинарный файл начало функции рекуррентная нейронная сеть IDA Pro

reverse engineering function recognition binary file function prologue recurrent neural network IDA Pro

Исследование выполнено без спонсорской поддержки.

The study was performed without external funding.

References 1

Wartell R., Zhou Y., Hamlen K.W., Kantarcioglu M., Thuraisingham Bh. Differentiating code from data in x86 binaries. In: Machine Learning and Knowledge Discovery in Databases: Proceedings: Part III: European Conference (ECML PKDD 2010), 05–09 September 2011, Athens, Greece. Berlin, Heidelberg: Springer; 2011. P. 522–536. https://doi.org/10.1007/978-3-642-23808-6_34

Benkraouda H., Diwan N., Wang G. You Can't Judge a Binary by Its Header: Data-Code Separation for Non-Standard ARM Binaries Using Pseudo Labels. In: 2025 IEEE Symposium on Security and Privacy (SP), 12–15 May 2025, San Francisco, CA, USA. IEEE; 2025. P. 3727–3745. https://doi.org/10.1109/SP61157.2025.00036

Qin S., Yang F., Wang H., et al. Tady: A Neural Disassembler without Structural Constraint Violations. arXiv. URL: https://arxiv.org/pdf/2506.13323 [Accessed 28th January 2026].

David Y., Alon U., Yahav E. Neural Reverse Engineering of Stripped Binaries using Augmented Control Flow Graphs. Proceedings of the ACM on Programming Languages. 2020;4. https://doi.org/10.1145/3428293

Jiang L., Jin X., Lin Zh. Beyond Classification: Inferring Function Names in Stripped Binaries via Domain Adapted LLMs. In: 32nd Annual Network and Distributed System Security Symposium (NDSS 2025), 24–28 February 2025, San Diego, California, USA. The Internet Society; 2025. https://doi.org/10.14722/ndss.2025.240797

Bao T., Burket J., Woo M., Turner R., Brumley D. BYTEWEIGHT: Learning to Recognize Functions in Binary Code. In: 23rd USENIX Security Symposium, 20–22 August 2014, San Diego, CA, USA. USENIX Association; 2014. P. 845–860.

He J., Li Sh., Wang X., et al. Neural-FEBI: Accurate Function Identification in Ethereum Virtual Machine Bytecode. Journal of Systems and Software. 2023;199. https://doi.org/10.1016/j.jss.2023.111627

Pei K., Guan J., Broughton M., et al. StateFormer: Fine-Grained Type Recovery from Binaries using Generative State Modeling. In: ESEC/FSE '21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 23–28 August 2021, Athens, Greece. New York: ACM; 2021. P. 690–702. https://doi.org/10.1145/3468264.3468607

Nitin V., Saieva A., Ray B., Kaiser G. DIRECT: A Transformer-based Model for Decompiled Identifier Renaming. In: Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021), 01–06 August 2021, Virtual Event. Association for Computational Linguistics; 2021. P. 48–57. https://doi.org/10.18653/v1/2021.nlp4prog-1.6

Wang H., Qu W., Katz G., et al. jTrans: Jump-Aware Transformer for Binary Code Similarity Detection. In: ISSTA '22: 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 18–22 July 2022, Virtual Event. New York: ACM; 2022. P. 1–13. https://doi.org/10.1145/3533767.3534367

Yu Z., Cao R., Tang Q., et al. Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection. Proceedings of the AAAI Conference on Artificial Intelligence. 2020;34(01):1145–1152. https://doi.org/10.1609/aaai.v34i01.5466

Duan Y., Li X., Wang J., Yin H. DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing. In: 27th Annual Network and Distributed System Security Symposium (NDSS 2020), 23–26 February 2020, San Diego, California, USA. The Internet Society; 2020. https://doi.org/10.14722/ndss.2020.24311

Li X., Qu Y., Yin H. PalmTree: Learning an Assembly Language Model for Instruction Embedding. In: CCS '21: 2021 ACM SIGSAC Conference on Computer and Communications Security, 15–19 November 2021, Virtual Event. New York: ACM; 2021. P. 3236–3251. https://doi.org/10.1145/3460120.3484587

Gao Z., Wang H., Wang Y., Zhang Ch. Virtual Compiler Is All You Need For Assembly Code Search. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics: Volume 1: Long Papers, 11–16 August 2024, Bangkok, Thailand. Association for Computational Linguistics; 2024. P. 3040–3051. https://doi.org/10.18653/v1/2024.acl-long.167

Liu Ch., Saul R., Sun Y., et al. ASSEMBLAGE: Automatic Binary Dataset Construction for Machine Learning. In: Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024 (NeurIPS 2024), 10–15 December 2024, Vancouver, BC, Canada. 2024. https://openreview.net/pdf?id=dsK5EmmomU

Andriesse D., Slowinska A., Bos H. Compiler-agnostic function detection in binaries. In: 2017 IEEE European Symposium on Security and Privacy, 26–28 April 2017, Paris, France. IEEE; 2017. P. 177–189. https://doi.org/10.1109/EuroSP.2017.11

Flores-Montoya A., Schulte E.M. Datalog disassembly. In: 29th USENIX Security Symposium (USENIX Security 2020), 12–14 August 2020. USENIX Association; 2020. P. 1075–1092. https://www.usenix.org/system/files/sec20-flores-montoya.pdf

Shin E.Ch.R., Song D., Moazzezi R. Recognizing Functions in Binaries with Neural Networks. In: 24th USENIX Security Symposium (USENIX Security 15), 12–14 August 2015, Washington, D.C., USA. USENIX Association; 2015. P. 611–626.

Pei K., Guan J., Williams-King D., Yang J., Jana S. XDA: Accurate, robust disassembly with transfer learning. In: 28th Annual Network and Distributed System Security Symposium (NDSS 2021), 21–25 February 2021, Virtual Event. The Internet Society; 2021. https://doi.org/10.14722/ndss.2021.23112

Yu Sh., Qu Y., Hu X., Yin H. DeepDi: Learning a relational graph convolutional network model on instructions for fast and accurate disassembly. In: 31st USENIX Security Symposium (USENIX Security 2022), 10–12 August 2022, Boston, MA, USA. USENIX Association; 2022. P. 2709–2725.

Evans R., Hawkins W., Wang B. RustBound: Function Boundary Detection over Rust Stripped Binaries. In: Security and Privacy in Cyber-Physical Systems and Smart Vehicles: Second EAI International Conference (SmartSP 2024), 07–08 November 2024, New Orleans, LA, USA. Cham: Springer; 2025. P. 237–256. https://doi.org/10.1007/978-3-031-93354-7_11

Guo W., Mu D., Xu J., et al. LEMNA: Explaining Deep Learning based Security Applications. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS 2018), 15–19 October 2018, Toronto, ON, Canada. New York: ACM; 2018. P. 364–379. https://doi.org/10.1145/3243734.3243792

Springer R., Schmitz A., Leinweber A., Urban T., Dietrich Ch. Padding Matters – Exploring Function Detection in PE Files. arXiv. URL: https://arxiv.org/abs/2504.21520 [Accessed 9th February 2026].

Bundt J., Davinroy M., Agadakos I., Oprea A., Robertson W.K. Black-box Attacks Against Neural Binary Function Detection. In: Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2023), 16–18 October 2023, Hong Kong, China. New York: ACM; 2023. https://doi.org/10.1145/3607199.3607200

The authors declare that there are no conflicts of interest present.