<?xml version="1.0" encoding="UTF-8"?>
<article article-type="research-article" dtd-version="1.3" xml:lang="ru" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="https://metafora.rcsi.science/xsd_files/journal3.xsd">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">moitvivt</journal-id>
      <journal-title-group>
        <journal-title xml:lang="ru">Моделирование, оптимизация и информационные технологии</journal-title>
        <trans-title-group xml:lang="en">
          <trans-title>Modeling, Optimization and Information Technology</trans-title>
        </trans-title-group>
      </journal-title-group>
      <issn pub-type="epub">2310-6018</issn>
      <publisher>
        <publisher-name>Издательство</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.26102/2310-6018/2026.56.5.007</article-id>
      <article-id pub-id-type="custom" custom-type="elpub">2273</article-id>
      <title-group>
        <article-title xml:lang="ru">Распознавание начала функций в бинарных файлах с использованием рекуррентных нейронных сетей</article-title>
        <trans-title-group xml:lang="en">
          <trans-title>Recognizing function prologues in binary files with recurrent neural networks</trans-title>
        </trans-title-group>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <name-alternatives>
            <name name-style="eastern" xml:lang="ru">
              <surname>Шайханов</surname>
              <given-names>Артем Серикович</given-names>
            </name>
            <name name-style="western" xml:lang="en">
              <surname>Shaykhanov</surname>
              <given-names>Artem Serikovich</given-names>
            </name>
          </name-alternatives>
          <email>artem.shaykhanov@gmail.com</email>
          <xref ref-type="aff">aff-1</xref>
        </contrib>
      </contrib-group>
      <aff-alternatives id="aff-1">
        <aff xml:lang="ru">Московский государственный технический университет им. Н.Э. Баумана</aff>
        <aff xml:lang="en">Bauman Moscow State Technical University</aff>
      </aff-alternatives>
      <pub-date pub-type="epub">
        <day>01</day>
        <month>01</month>
        <year>2026</year>
      </pub-date>
      <volume>1</volume>
      <issue>1</issue>
      <elocation-id>10.26102/2310-6018/2026.56.5.007</elocation-id>
      <permissions>
        <copyright-statement>Copyright © Авторы, 2026</copyright-statement>
        <copyright-year>2026</copyright-year>
        <license license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/">
          <license-p>This work is licensed under a Creative Commons Attribution 4.0 International License</license-p>
        </license>
      </permissions>
      <self-uri xlink:href="https://moitvivt.ru/ru/journal/article?id=2273"/>
      <abstract xml:lang="ru">
        <p>В статье рассматривается задача распознавания начал функций в бинарных файлах, которая является одной из ключевых подзадач реверс-инжиниринга программного обеспечения. Актуальность исследования обусловлена ограничениями традиционных детерминированных методов, основанных на эвристиках, сигнатурном анализе и анализе графов потока управления, а также недостаточной универсальностью существующих нейросетевых решений, ориентированных преимущественно на архитектуры x86 и x86–64. Целью работы является разработка и экспериментальная оценка модели машинного обучения, способной эффективно распознавать начала функций в бинарных файлах, собранных под альтернативные машинные архитектуры, с учетом прикладной специфики задач обратной разработки. В качестве базового подхода предложено использование рекуррентной нейронной сети, обрабатывающей последовательности байтов бинарного файла. Проведен сравнительный анализ существующих нейросетевых моделей распознавания функций, выявлены их преимущества и ограничения, что позволило обосновать выбор простой и воспроизводимой архитектуры RNN. В рамках исследования детально изучено влияние ключевых гиперпараметров модели, включая длину входной последовательности, количество нейронов в рекуррентном слое и веса функции потерь, на качество распознавания. Эксперименты выполнены на бинарных файлах микроконтроллеров ESP32 архитектуры Xtensa Little Endian и STM32WBA6 с ядром Cortex-M33 архитектуры ARMv8-M с использованием как стандартного, так и случайного выравниваний, что позволило оценить устойчивость модели к изменению структуры бинарных данных. Результаты показывают, что длина входной последовательности является наиболее значимым гиперпараметром, в то время как влияние весов функции потерь носит вторичный характер. Установлено, что модель не обладает обобщаемостью между различными типами выравниваний, что требует предварительного анализа бинарного файла перед применением. На основе разработанной модели реализовано расширение для дизассемблера IDA Pro, демонстрирующее практическую применимость предложенного подхода в реальных задачах реверс-инжиниринга.</p>
      </abstract>
      <trans-abstract xml:lang="en">
        <p>The article discusses the problem of recognising function beginnings in binary files, which is one of the key subtasks of software reverse engineering. The relevance of the research is due to the limitations of traditional deterministic methods based on heuristics, signature analysis, and control flow graph analysis, as well as the insufficient versatility of existing neural network solutions, which are primarily focused on x86 and x86-64 architectures. The aim of the work is to develop and experimentally evaluate a machine learning model capable of effectively recognising function starts in binary files compiled for alternative machine architectures, taking into account the applied specifics of reverse engineering tasks. The basic approach proposed is to use a recurrent neural network that processes byte sequences of a binary file. A comparative analysis of existing neural network models for function recognition was conducted, and their advantages and limitations were identified, which made it possible to justify the choice of a simple and reproducible RNN architecture. The study examined in detail the impact of key model hyperparameters, including the length of the input sequence, the number of neurons in the recurrent layer, and the weights of the loss function, on the quality of recognition. The experiments were performed on binary files of the ESP32 microcontroller with Xtensa Little Endian architecture and STM32WBA6 microcontroller of Cortex-M33 core with ARMv8-M architecture using both standard and random alignment, which made it possible to evaluate the model's resistance to changes in the structure of binary data. The results show that the length of the input sequence is the most significant hyperparameter, while the influence of the loss function weights is secondary. It has been established that the model does not generalise between different types of alignments, which requires preliminary analysis of the binary file before application. Based on the developed model, an extension for the IDA Pro disassembler has been implemented, demonstrating the practical applicability of the proposed approach in real reverse engineering tasks.</p>
      </trans-abstract>
      <kwd-group xml:lang="ru">
        <kwd>реверс-инжиниринг</kwd>
        <kwd>распознавание функций</kwd>
        <kwd>бинарный файл</kwd>
        <kwd>начало функции</kwd>
        <kwd>рекуррентная нейронная сеть</kwd>
        <kwd>IDA Pro</kwd>
      </kwd-group>
      <kwd-group xml:lang="en">
        <kwd>reverse engineering</kwd>
        <kwd>function recognition</kwd>
        <kwd>binary file</kwd>
        <kwd>function prologue</kwd>
        <kwd>recurrent neural network</kwd>
        <kwd>IDA Pro</kwd>
      </kwd-group>
      <funding-group>
        <funding-statement xml:lang="ru">Исследование выполнено без спонсорской поддержки.</funding-statement>
        <funding-statement xml:lang="en">The study was performed without external funding.</funding-statement>
      </funding-group>
    </article-meta>
  </front>
  <back>
    <ref-list>
      <title>References</title>
      <ref id="cit1">
        <label>1</label>
        <mixed-citation xml:lang="ru">Wartell R., Zhou Y., Hamlen K.W., Kantarcioglu M., Thuraisingham Bh. Differentiating code from data in x86 binaries. In: Machine Learning and Knowledge Discovery in Databases: Proceedings: Part III: European Conference (ECML PKDD 2010), 05–09 September 2011, Athens, Greece. Berlin, Heidelberg: Springer; 2011. P. 522–536. https://doi.org/10.1007/978-3-642-23808-6_34</mixed-citation>
      </ref>
      <ref id="cit2">
        <label>2</label>
        <mixed-citation xml:lang="ru">Benkraouda H., Diwan N., Wang G. You Can't Judge a Binary by Its Header: Data-Code Separation for Non-Standard ARM Binaries Using Pseudo Labels. In: 2025 IEEE Symposium on Security and Privacy (SP), 12–15 May 2025, San Francisco, CA, USA. IEEE; 2025. P. 3727–3745. https://doi.org/10.1109/SP61157.2025.00036</mixed-citation>
      </ref>
      <ref id="cit3">
        <label>3</label>
        <mixed-citation xml:lang="ru">Qin S., Yang F., Wang H., et al. Tady: A Neural Disassembler without Structural Constraint Violations. arXiv. URL: https://arxiv.org/pdf/2506.13323 [Accessed 28th January 2026].</mixed-citation>
      </ref>
      <ref id="cit4">
        <label>4</label>
        <mixed-citation xml:lang="ru">David Y., Alon U., Yahav E. Neural Reverse Engineering of Stripped Binaries using Augmented Control Flow Graphs. Proceedings of the ACM on Programming Languages. 2020;4. https://doi.org/10.1145/3428293</mixed-citation>
      </ref>
      <ref id="cit5">
        <label>5</label>
        <mixed-citation xml:lang="ru">Jiang L., Jin X., Lin Zh. Beyond Classification: Inferring Function Names in Stripped Binaries via Domain Adapted LLMs. In: 32nd Annual Network and Distributed System Security Symposium (NDSS 2025), 24–28 February 2025, San Diego, California, USA. The Internet Society; 2025. https://doi.org/10.14722/ndss.2025.240797</mixed-citation>
      </ref>
      <ref id="cit6">
        <label>6</label>
        <mixed-citation xml:lang="ru">Bao T., Burket J., Woo M., Turner R., Brumley D. BYTEWEIGHT: Learning to Recognize Functions in Binary Code. In: 23rd USENIX Security Symposium, 20–22 August 2014, San Diego, CA, USA. USENIX Association; 2014. P. 845–860.</mixed-citation>
      </ref>
      <ref id="cit7">
        <label>7</label>
        <mixed-citation xml:lang="ru">He J., Li Sh., Wang X., et al. Neural-FEBI: Accurate Function Identification in Ethereum Virtual Machine Bytecode. Journal of Systems and Software. 2023;199. https://doi.org/10.1016/j.jss.2023.111627</mixed-citation>
      </ref>
      <ref id="cit8">
        <label>8</label>
        <mixed-citation xml:lang="ru">Pei K., Guan J., Broughton M., et al. StateFormer: Fine-Grained Type Recovery from Binaries using Generative State Modeling. In: ESEC/FSE '21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 23–28 August 2021, Athens, Greece. New York: ACM; 2021. P. 690–702. https://doi.org/10.1145/3468264.3468607</mixed-citation>
      </ref>
      <ref id="cit9">
        <label>9</label>
        <mixed-citation xml:lang="ru">Nitin V., Saieva A., Ray B., Kaiser G. DIRECT: A Transformer-based Model for Decompiled Identifier Renaming. In: Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021), 01–06 August 2021, Virtual Event. Association for Computational Linguistics; 2021. P. 48–57. https://doi.org/10.18653/v1/2021.nlp4prog-1.6</mixed-citation>
      </ref>
      <ref id="cit10">
        <label>10</label>
        <mixed-citation xml:lang="ru">Wang H., Qu W., Katz G., et al. jTrans: Jump-Aware Transformer for Binary Code Similarity Detection. In: ISSTA '22: 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 18–22 July 2022, Virtual Event. New York: ACM; 2022. P. 1–13. https://doi.org/10.1145/3533767.3534367</mixed-citation>
      </ref>
      <ref id="cit11">
        <label>11</label>
        <mixed-citation xml:lang="ru">Yu Z., Cao R., Tang Q., et al. Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection. Proceedings of the AAAI Conference on Artificial Intelligence. 2020;34(01):1145–1152. https://doi.org/10.1609/aaai.v34i01.5466</mixed-citation>
      </ref>
      <ref id="cit12">
        <label>12</label>
        <mixed-citation xml:lang="ru">Duan Y., Li X., Wang J., Yin H. DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing. In: 27th Annual Network and Distributed System Security Symposium (NDSS 2020), 23–26 February 2020, San Diego, California, USA. The Internet Society; 2020. https://doi.org/10.14722/ndss.2020.24311</mixed-citation>
      </ref>
      <ref id="cit13">
        <label>13</label>
        <mixed-citation xml:lang="ru">Li X., Qu Y., Yin H. PalmTree: Learning an Assembly Language Model for Instruction Embedding. In: CCS '21: 2021 ACM SIGSAC Conference on Computer and Communications Security, 15–19 November 2021, Virtual Event. New York: ACM; 2021. P. 3236–3251. https://doi.org/10.1145/3460120.3484587</mixed-citation>
      </ref>
      <ref id="cit14">
        <label>14</label>
        <mixed-citation xml:lang="ru">Gao Z., Wang H., Wang Y., Zhang Ch. Virtual Compiler Is All You Need For Assembly Code Search. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics: Volume 1: Long Papers, 11–16 August 2024, Bangkok, Thailand. Association for Computational Linguistics; 2024. P. 3040–3051. https://doi.org/10.18653/v1/2024.acl-long.167</mixed-citation>
      </ref>
      <ref id="cit15">
        <label>15</label>
        <mixed-citation xml:lang="ru">Liu Ch., Saul R., Sun Y., et al. ASSEMBLAGE: Automatic Binary Dataset Construction for Machine Learning. In: Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024 (NeurIPS 2024), 10–15 December 2024, Vancouver, BC, Canada. 2024. https://openreview.net/pdf?id=dsK5EmmomU</mixed-citation>
      </ref>
      <ref id="cit16">
        <label>16</label>
        <mixed-citation xml:lang="ru">Andriesse D., Slowinska A., Bos H. Compiler-agnostic function detection in binaries. In: 2017 IEEE European Symposium on Security and Privacy, 26–28 April 2017, Paris, France. IEEE; 2017. P. 177–189. https://doi.org/10.1109/EuroSP.2017.11</mixed-citation>
      </ref>
      <ref id="cit17">
        <label>17</label>
        <mixed-citation xml:lang="ru">Flores-Montoya A., Schulte E.M. Datalog disassembly. In: 29th USENIX Security Symposium (USENIX Security 2020), 12–14 August 2020. USENIX Association; 2020. P. 1075–1092. https://www.usenix.org/system/files/sec20-flores-montoya.pdf</mixed-citation>
      </ref>
      <ref id="cit18">
        <label>18</label>
        <mixed-citation xml:lang="ru">Shin E.Ch.R., Song D., Moazzezi R. Recognizing Functions in Binaries with Neural Networks. In: 24th USENIX Security Symposium (USENIX Security 15), 12–14 August 2015, Washington, D.C., USA. USENIX Association; 2015. P. 611–626.</mixed-citation>
      </ref>
      <ref id="cit19">
        <label>19</label>
        <mixed-citation xml:lang="ru">Pei K., Guan J., Williams-King D., Yang J., Jana S. XDA: Accurate, robust disassembly with transfer learning. In: 28th Annual Network and Distributed System Security Symposium (NDSS 2021), 21–25 February 2021, Virtual Event. The Internet Society; 2021. https://doi.org/10.14722/ndss.2021.23112</mixed-citation>
      </ref>
      <ref id="cit20">
        <label>20</label>
        <mixed-citation xml:lang="ru">Yu Sh., Qu Y., Hu X., Yin H. DeepDi: Learning a relational graph convolutional network model on instructions for fast and accurate disassembly. In: 31st USENIX Security Symposium (USENIX Security 2022), 10–12 August 2022, Boston, MA, USA. USENIX Association; 2022. P. 2709–2725.</mixed-citation>
      </ref>
      <ref id="cit21">
        <label>21</label>
        <mixed-citation xml:lang="ru">Evans R., Hawkins W., Wang B. RustBound: Function Boundary Detection over Rust Stripped Binaries. In: Security and Privacy in Cyber-Physical Systems and Smart Vehicles: Second EAI International Conference (SmartSP 2024), 07–08 November 2024, New Orleans, LA, USA. Cham: Springer; 2025. P. 237–256. https://doi.org/10.1007/978-3-031-93354-7_11</mixed-citation>
      </ref>
      <ref id="cit22">
        <label>22</label>
        <mixed-citation xml:lang="ru">Guo W., Mu D., Xu J., et al. LEMNA: Explaining Deep Learning based Security Applications. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS 2018), 15–19 October 2018, Toronto, ON, Canada. New York: ACM; 2018. P. 364–379. https://doi.org/10.1145/3243734.3243792</mixed-citation>
      </ref>
      <ref id="cit23">
        <label>23</label>
        <mixed-citation xml:lang="ru">Springer R., Schmitz A., Leinweber A., Urban T., Dietrich Ch. Padding Matters – Exploring Function Detection in PE Files. arXiv. URL: https://arxiv.org/abs/2504.21520 [Accessed 9th February 2026].</mixed-citation>
      </ref>
      <ref id="cit24">
        <label>24</label>
        <mixed-citation xml:lang="ru">Bundt J., Davinroy M., Agadakos I., Oprea A., Robertson W.K. Black-box Attacks Against Neural Binary Function Detection. In: Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2023), 16–18 October 2023, Hong Kong, China. New York: ACM; 2023. https://doi.org/10.1145/3607199.3607200</mixed-citation>
      </ref>
    </ref-list>
    <fn-group>
      <fn fn-type="conflict">
        <p>The authors declare that there are no conflicts of interest present.</p>
      </fn>
    </fn-group>
  </back>
</article>