Developing a computer vision model for region detection in visually rich documents

idNikitin P.V., idGorokhova R.I.

UDC 004.9
DOI: 10.26102/2310-6018/2025.49.2.010

Abstract
List of references
About authors

The problem of efficient automation of visually rich document processing is an important part of computer vision research. This paper is devoted to the development of a computer vision model for region detection in visually rich documents, with an emphasis on receipt processing using reinforcement learning. In the context of the growing volume of paper documentation and the need to automate data processing, efficient identification of key elements of receipts (such as amounts, dates, and product names) is becoming especially relevant. The paper presents the architecture of the model based on convolutional neural networks (CNN), which is trained on a variety of datasets including receipt images of different formats and qualities. The methods of information extraction and the reinforcement learning algorithm are considered, which uses a trimmed loss function, a reinforcement learning loop presented in SpanIE-Recur. The stages of data preprocessing are described, including sample augmentation and image normalization, which contributes to increasing the detection accuracy. The experimental results show the high efficiency of the proposed model, achieving significant accuracy and recall in identifying regions of interest. Possible applications of this technology in the fields of accounting automation, financial analysis and electronic document management are also discussed. In conclusion, the importance of further research in the field of improving image processing algorithms and expanding the functionality of the model to work with other types of documents is emphasized.

1. Huang D. Algorithms for Extracting Information from Texts, Parsing Web Pages Using the Python Programming Language. Aktual'nye issledovaniya. 2022;(30):21–24. (In Russ.).

2. Shestakova M.V., Golovnina A.A., Golovnin O.K. Knowledge Mining from Graphic and Text Documents Based on Artificial Intelligence. In: Problems of Computer Science in Education, Management, Economy and Technology: Proceedings of 24th International and Technical Conference, 22–23 November 2024, Penza, Russia. Penza: Penza State University; 2024. P. 273–278. (In Russ.).

3. Chinyakov O.E. Electronic Document Management: Properties and Problems of Implementation. Humanitarian and Political-Low Studies. 2023;(1):43–50. (In Russ.).

4. Isachkova L.N., Asanova N.A., Huth S.J., Yeshugova F.R. Ensuring Economic Security in the Electronic Document Management System in the Context of Digital Business Transformation. Vestnik Akademii znanii. 2021;(45):113–117. (In Russ.). https://doi.org/10.24412/2304-6139-2021-11342

5. Mandvikar Sh. Augmenting Intelligent Document Processing (IDP) Workflows with Contemporary Large Language Models (LLMs). International Journal of Computer Trends and Technology. 2023;71(10):80–91. https://doi.org/10.14445/22312803/IJCTT-V71I10P110

6. Nicolaieff L., Kandi M.M., Zegaoui Yo., Bortolaso Ch. Intelligent Document Processing with Small and Relevant Training Dataset. In: 2022 International Conference on Intelligent Systems and Computer Vision (ISCV), 18–20 May 2022, Fez, Morocco. IEEE; 2022. P. 1–7. https://doi.org/10.1109/ISCV54655.2022.9806100

7. Wang Z., Zhou Y., Wei W., Lee Ch.-Yu, Tata S. VRDU: A Benchmark for Visually-rich Document Understanding. In: KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 06–10 August 2023, Long Beach, CA, USA. New York: Association for Computing Machinery; 2023. P. 5184–5193. https://doi.org/10.1145/3580305.3599929

8. Ha H.T., Horák A. Information Extraction from Scanned Invoice Images using Text Analysis and Layout Features. Signal Processing: Image Communication. 2022;102. https://doi.org/10.1016/j.image.2021.116601

9. Aggarwal N., Patra S., Sinha S., Jaiman A., Ghosh D. Data Extraction from Scanned Invoice Documents in Multiple Languages. In: International Workshop on Signal Processing and Machine Learning (WSPML 2023): Proceedings: Volume 12943, 22–24 September 2023, Hangzhou, China. 2023. https://doi.org/10.1117/12.3019910

10. Chub V.S. Izvlechenie priznakov rekurrentnymi neironnymi setyami iz bol'shikh ob"emov dannykh. In: Za nami budushchee: vzglyad molodykh uchenykh na innovatsionnoe razvitie obshchestva: sbornik nauchnykh statei 3-i Vserossiiskoi molodezhnoi nauchnoi konferentsii, 03 June 2022, Kursk, Russia. Kursk: South-West State University; 2022. P. 536–539. (In Russ.).

11. Han Y., Chen Zh., He H. Artificial Intelligence and Language Analysis Technologies. Stolypin Annals. 2024;6(10). (In Russ.). URL: https://elibrary.ru/item.asp?id=75102830

12. Burnashev R., Anvarova L. Application of Neural Networks in Automatic Translation and Natural Language Processing. Universum: tekhnicheskie nauki. 2024;(4–1):39–43. (In Russ.).

13. Xie J., Wendt J.B., Zhou Y., Ebner S., Tata S. FieldSwap: Data Augmentation for Effective Form-Like Document Extraction. In: 2024 IEEE 40th International Conference on Data Engineering (ICDE), 13–16 Мау 2024, Utrecht, Netherlands. IEEE; 2024. P. 4722–4732. https://doi.org/10.1109/ICDE60146.2024.00359

14. Xu Y., Li M., Cui L., Huang Sh., Wei F., Zhou M. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In: KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 06–10 July 2020, Virtual Event, CA, USA. New York: Association for Computing Machinery; 2020. P. 1192–1200. https://doi.org/10.1145/3394486.3403172

15. Novikov N.P., Vinogradov V.I. Experience in Using the Transformer Network Architecture to Approximate Agent’s Policy in Reinforcement Learning. Modelirovanie i analiz dannykh. 2024;14(2):7–22. (In Russ.). https://doi.org/10.17759/mda.2024140201

16. Alves E.L.G., Carvalho C., De Lima P.M., Pinheiro V., Furtado V. Information Extraction from Financial Statements Based on Visually Rich Document Models. In: Proceedings of the 20th National Meeting on Artificial and Computational Intelligence (ENIAC 2023), 25–29 September 2023, Belo Horizonte, Brazil. Porto Alegre: Sociedade Brasileira de Computação; 2023. P. 894–908. https://doi.org/10.5753/eniac.2023.234520

17. Huang Yu., Lv T., Cui L., Lu Yu., Wei F. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. In: MM '22: Proceedings of the 30th ACM International Conference on Multimedia, 10–14 October 2022, Lisboa, Portugal. New York: Association for Computing Machinery; 2022. P. 4083–4091. https://doi.org/10.1145/3503161.3548112

18. Mistry J., Arzeno N.M. Document Understanding for Healthcare Referrals. In: 2023 IEEE 11th International Conference on Healthcare Informatics (ICHI 2023), 26–29 June 2023, Houston, TX, USA. IEEE; 2023. P. 460–464. https://doi.org/10.1109/ICHI57859.2023.00067

19. Ding Y., Vaiani L., Han C., et al. 3MVRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding. In: Findings of the Association for Computational Linguistics, ACL 2024, 11–16 August 2024, Bangkok, Thailand. New York: Association for Computational Linguistics; 2024. P. 15233–15244. https://doi.org/10.18653/v1/2024.findings-acl.903

20. Tchuitcheu W.C., Lu T., Dooms A. Table Representation Learning Using Heterogeneous Graph Embedding. Pattern Recognition. 2024;156. https://doi.org/10.1016/j.patcog.2024.110734

Nikitin Petr Vladimirovich
Candidate of Pedagogical Sciences, Docent
Email: pvnikitin@fa.ru

ORCID | eLibrary |

Financial University under the Government of the Russian Federation

Moscow, Russian Federation

Gorokhova Rimma Ivanovna
Candidate of Pedagogical Sciences, Docent
Email: rigorokhova@fa.ru

WoS | Scopus | ORCID | eLibrary |

Financial University under the Government of the Russian Federation

Moscow, Russian Federation

Keywords: visually rich document, computer vision, reinforcement learning, object detection, receipt processing, automation, document areas, data preprocessing, electronic document management

Sources of funding: The work was prepared based on the results of research carried out at the expense of budgetary funds under a state assignment from the Financial University.

For citation: Nikitin P.V., Gorokhova R.I. Developing a computer vision model for region detection in visually rich documents. Modeling, Optimization and Information Technology. 2025;13(2). URL: https://moitvivt.ru/ru/journal/article?id=1858 DOI: 10.26102/2310-6018/2025.49.2.010 (In Russ).

599

Full text in PDF

Скачать JATS XML

Received 20.03.2025

Revised 14.04.2025

Accepted 21.04.2025

Published 30.06.2025