Assessing the quality of the result in the problem of source code generation from an image

Nikitin I.V.

UDC 004.832.22
DOI: 10.26102/2310-6018/2025.48.1.030

Abstract
List of references
About authors

This study is an assessment of the feasibility of building a system for executing functional tests for the task of generating source code from an image. There are many different metrics for assessing the quality of text predicted by a neural network: from mathematical ones, such as BLEU, Rogue, and those that use another model for evaluation, such as BERTScore, BLEURT. However, the problem with generating source code for a program is that the code is a set of instructions to perform a specific task. The relevance is that in publications related to the pix2code system, there was no mention of an automated test environment that can check whether the resulting code meets the specified conditions. In the course of the work done, a subsystem was implemented that can automatically obtain information about the differences between an image based on a predicted code and an image based on a reference code. Also, the results of this system are compared with the BLEU metric. As a result of the experiment, we can conclude that the BLEU value and the execution of tests do not have an obvious relationship between them, which means that tests are necessary for additional checks of the efficiency of the model.

1. Nikitin I.V. Influence of the TensorFlow library’s version on the quality of code generation from an image. Modeling, Optimization and Information Technology. 2024;12(4). (In Russ.). https://doi.org/10.26102/2310-6018/2024.47.4.040

2. Zou D., Wu G. Automatic Code Generation for Android Applications Based on Improved Pix2code. Journal of Artificial Intelligence and Technology. 2024;4(4):325–331. https://doi.org/10.37965/jait.2024.0515

3. Beltramelli T. pix2code: Generating Code from a Graphical User Interface Screenshot. In: EICS '18: Proceedings of the ACM SIGCHI Symposium on Engineering Interactive Computing Systems, 19–22 June 2018, Paris, France. New York: Association for Computing Machinery; 2018. https://doi.org/10.1145/3220134.3220135

4. Zhu Zh., Xue Zh., Yuan Z. Automatic Graphics Program Generation Using Attention–Based Hierarchical Decoder. In: Computer Vision – ACCV 2018: 14th Asian Conference on Computer Vision: Revised Selected Papers: Part VI, 02–06 December 2018, Perth, Australia. Cham: Springer; 2019. pp. 181–196. https://doi.org/10.1007/978-3-030-20876-9_12

5. Papineni K., Roukos S., Ward T., Zhu W.-J. BLEU: a Method for Automatic Evaluation of Machine Translation. In: ACL '02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 07–12 July 2002, Philadelphia, USA. Stroudsburg: Association for Computational Linguistics; 2002. pp. 311–318. https://doi.org/10.3115/1073083.1073135

6. Doddington G. Automatic Evaluation of Machine Translation Quality Using N-gram Co-occurrence Statistics. In: HLT '02: Proceeding of the Second International Conference on Human Language Technology Research, 24–27 March 2002, San Diego, USA. San Francisco: Morgan Kaufmann Publishers Inc.; 2002. pp. 138–145. https://doi.org/10.3115/1289189.1289273

7. Lin Ch.-Ye. ROGUE: A Package for Automatic Evaluation of Summaries. In: Proceedings of the Workshop on Text Summarization Branches Out, 25–26 July 2004, Barcelona, Spain. Association for Computational Linguistics; 2004. pp. 74–81.

8. Popović M. chrF++: words helping character n-grams. In: Proceedings of the Second Conference on Machine Translation, 07–08 September 2017, Copenhagen, Denmark. Association for Computational Linguistics; 2017. pp. 612–618. https://doi.org/10.18653/v1/W17-4770

9. Hendrycks D., Basart S., Kadavath S., et al. Measuring Coding Challenge Competence With APPS. In: 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, 06–14 December 2021, Online. https://doi.org/10.48550/arXiv.2105.09938

10. Zhang T., Kishore V., Wu F., Weinberger K.Q., Artzi Yo. BERTScore: evaluating Text Generation with BERT. In: 8th International Conference on Learning Representations, ICLR 2020, 26–30 April 2020, Addis Ababa, Ethiopia. 2020. https://doi.org/10.48550/arXiv.1904.09675

11. Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 02–07 June 2019, Mineapolis, USA. Association for Computational Linguistics; 2019. pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423

12. Rei R., Stewart C., Farinha A.C., Lavie A. COMET: A Neural Framework for MT Evaluation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 16–20 November 2020, Online. Association for Computational Linguistics; 2020. pp. 2685–2702. https://doi.org/10.18653/v1/2020.emnlp-main.213

13. Tran N., Tran H., Nguyen S., Nguyen H., Nguyen T. Does BLEU Score Work for Code Migration? In: 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), 25–26 May 2019, Montreal, USA. IEEE; 2019. pp. 165–176. https://doi.org/10.1109/ICPC.2019.00034

14. Ren Sh., Guo D., Lu Sh., et al. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv. URL: https://doi.org/10.48550/arXiv.2009.10297 [Accessed 19th February 2025].

15. Evtikhiev M., Bogomolov E., Sokolov Ya., Bryksin T. Out of the BLEU: How Should We Assess Quality of the Code Generation Models? Journal of Systems and Software. 2023;203. https://doi.org/10.1016/j.jss.2023.111741

Nikitin Ilya Vladimirovich

Plekhanov Russian University of Economics

Moscow, Russian Federation

Keywords: code generation, image, machine learning, BLEU, functional tests

For citation: Nikitin I.V. Assessing the quality of the result in the problem of source code generation from an image. Modeling, Optimization and Information Technology. 2025;13(1). URL: https://moitvivt.ru/ru/journal/pdf?id=1830 DOI: 10.26102/2310-6018/2025.48.1.030 (In Russ).

172

Full text in PDF

Received 20.02.2025

Revised 04.03.2025

Accepted 11.03.2025

Published 31.03.2025