Human pose estimation from video stream

idPotenko M.A.

UDC 004.8
DOI: 10.26102/2310-6018/2025.49.2.036

Abstract
List of references
About authors

The article presents a study of a human body pose estimation system based on the use of two neural networks. The proposed system allows determining the spatial location of 33 key points corresponding to the main joints of the human body (wrists, elbows, shoulders, feet, etc.), as well as constructing a segmentation mask for accurate delineation of human figure boundaries in an image. The first neural network implements object detection functions and is based on the Single Shot Detector (SSD) architecture with the application of Feature Pyramid Network (FPN) principles. This approach ensures the effective combination of features at different levels of abstraction and enables the processing of input images with a resolution of 224×224 for subsequent determination of people's positions in a frame. A distinctive feature of the implementation is the use of information from previous frames, which helps optimize computational resources. The second neural network is designed for key point detection and segmentation mask construction. It is also based on the principles of multi-scale feature analysis using FPN, ensuring high accuracy in localizing key points and object boundaries. The network operates on images with a resolution of 256×256, which allows achieving the necessary precision in determining spatial coordinates. The proposed architecture is characterized by modularity and scalability, enabling the system to be adapted for various tasks requiring different numbers of control points. The research results have broad practical applications in fields such as computer vision, animation, cartoon production, security systems, and other areas related to the analysis and processing of visual information.

1. Andriluka M., Pishchulin L., Gehler P., Schiele B. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, 23–28 June 2014, Columbus, OH, USA. IEEE; 2014. P. 3686–3693. https://doi.org/10.1109/CVPR.2014.471

2. Newell A., Yang K., Deng J. Stacked Hourglass Networks for Human Pose Estimation. In: Computer Vision – ECCV 2016: 14th European Conference: Proceedings: Part VIII, 11–14 October 2016, Amsterdam, The Netherlands. Cham: Springer; 2016. P. 483–499. https://doi.org/10.1007/978-3-319-46484-8_29

3. Zhao Zh.-Q., Zheng P., Xu Sh.-T., Wu X. Object Detection With Deep Learning: A Review. IEEE Transactions on Neural Networks and Learning Systems. 2019;30(11):3212–3232. https://doi.org/10.1109/TNNLS.2018.2876865

4. Zhang F., Zhu X., Ye M. Fast Human Pose Estimation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 15–20 June 2019, Long Beach, CA, USA. IEEE; 2019. P. 3512–3521. https://doi.org/10.1109/CVPR.2019.00363

5. Guo M.-H., Xu T.-X., Liu J.-J., et al. Attention Mechanisms in Computer Vision: A Survey. Computational Visual Media. 2022;8(3):331–368. https://doi.org/10.1007/s41095-022-0271-y

6. Liu W., Anguelov D., Erhan D., et al. SSD: Single Shot MultiBox Detector. In: Computer Vision – ECCV 2016: 14th European Conference: Proceedings: Part I, 11–14 October 2016, Amsterdam, The Netherlands. Cham: Springer; 2016. P. 21–37. https://doi.org/10.1007/978-3-319-46448-0_2

7. Lin T.-Yi, Maire M., Belongie S., et al. Microsoft COCO: Common Objects in Context. In: Computer Vision – ECCV 2014: 13th European Conference: Proceedings: Part V, 06–12 September 2014, Zurich, Switzerland. Cham: Springer; 2014. P. 740–755. https://doi.org/10.1007/978-3-319-10602-1_48

8. Lin T.-Yi, Dollár P., Girshick R., He K., Hariharan B., Belongie S. Feature Pyramid Networks for Object Detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 21–26 July 2017, Honolulu, HI, USA. IEEE; 2017. P. 936–944. https://doi.org/10.1109/CVPR.2017.106

9. He K., Zhang X., Ren Sh., Sun J. Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 27–30 June 2016, Las Vegas, NV, USA. IEEE; 2016. P. 770–778. https://doi.org/10.1109/CVPR.2016.90

10. Simonyan K., Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv. URL: https://doi.org/10.48550/arXiv.1409.1556 [Accessed 25th March 2025].

11. Neubeck A., Van Gool L. Efficient Non-Maximum Suppression. In: 18th International Conference on Pattern Recognition (ICPR'06), 20–24 August 2006, Hong Kong, China. IEEE; 2006. P. 850–855. https://doi.org/10.1109/ICPR.2006.479

12. Kingma D.P., Ba J. Adam: A Method for Stochastic Optimization. In: 3rd International Conference on Learning Representations, ICLR 2015, 07–09 May 2015, San Diego, CA, USA. 2015. https://doi.org/10.48550/arXiv.1412.6980

13. Charbonnier P., Blanc-Féraud L., Aubert G., Barlaud M. Two Deterministic Half-Quadratic Regularization Algorithms for Computed Imaging. In: Proceedings of 1st International Conference on Image Processing, 13–16 November 1994, Austin, TX, USA. IEEE; 1994. P. 168–172. https://doi.org/10.1109/ICIP.1994.413553

14. Goodfellow I., Bengio Yo., Courville A. Deep Learning. Cambridge: MIT Press; 2016. 800 p.

15. Potenko M. Application of Synthetic Data in Training Neural Networks for Human Pose Estimation. In: Eksperimental'nye i teoreticheskie issledovaniya v sovremennoi nauke: sbornik statei po materialam CVIII mezhdunarodnoi nauchno-prakticheskoi konferentsii, 25 December 2024, Novosibirsk, Russia. Novosibirsk: Sibirskaya akademicheskaya kniga; 2024. P. 11–17. (In Russ.).

Potenko Maxim Alexeevich

ORCID |

Moscow Aviation Institute (National Research University)

Moscow, Russian Federation

Keywords: neural networks, convolutional neural networks, machine learning, computer vision, human pose estimation, keypoints, image segmentation

For citation: Potenko M.A. Human pose estimation from video stream. Modeling, Optimization and Information Technology. 2025;13(2). URL: https://moitvivt.ru/ru/journal/pdf?id=1920 DOI: 10.26102/2310-6018/2025.49.2.036 (In Russ).

Full text in PDF

Received 22.04.2025

Revised 20.05.2025

Accepted 04.06.2025

Published 30.06.2025