Keywords: video analysis of hand movements, gesture recognition, action recognition, deep neural networks, transformer, technological operations
Study of deep learning models in the task of recognizing technological operations as a sequence of hand movements
UDC 004.896
DOI: 10.26102/2310-6018/2024.45.2.035
In this paper, we consider methods for recognizing on video a specific class of technological manual labor operations, which are a sequence of movements of the hands and fingers. The technological operation in this work is considered as a sequence of new specific symbols of the sign language. The paper considers various methods of gesture recognition on video. In this paper, a two-step approach was investigated. At the first stage, the key points of the hands on each frame are recognized by using the open mediapipe library. At the second stage, a frame-by-frame sequence of keypoints transformed into text using a trained neural network of the transformer architecture. The main attention is paid to training a neural network model of the Transformer architecture based on the open American Sign Language (ASL) dataset for recognizing sign language sentences in video. The paper considers the applicability of approach and the trained model of ASL for recognizing technological operations of manual labor with fine-motor skills as a text sequence. The results obtained in this paper can be useful in the study of labor processes with fast movements and short time intervals in algorithms for recognizing technological operations of manual labor on video data.
1. Hou Z., Peng X., Qiao Y., Tao D. Visual Compositional Learning for Human-Object Interaction Detection. In: Computer Vision – ECCV 2020: 16th European Conference: Proceedings: Part XV, 23-28 August 2020, Glasgow, United Kingdom. Cham: Springer; 2020. P. 584–600. https://doi.org/10.1007/978-3-030-58555-6_35
2. Lin T.-Y., Dollár P., Girshick R., He K., Hariharan B., Belongie S. Feature Pyramid Networks for Object Detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 21-26 July 2017, Honolulu, HI, USA. IEEE; 2017. P. 936–944. https://doi.org/10.1109/CVPR.2017.106
3. Liu W., Anguelov D., Erhan D., Szegedy C., Reed S., Fu C.-Y., Berg A.C. SSD: Single Shot MultiBox Detector. In: Computer Vision – ECCV 2016: 14th European Conference: Proceedings: Part I, 11-14 October 2016, Amsterdam, The Netherlands. Cham: Springer; 2016. P. 21–37. https://doi.org/10.1007/978-3-319-46448-0_2
4. Nie J., Anwer R.M., Cholakkal H., Khan F.S., Pang Y., Shao L. Enriched Feature Guided Refinement Network for Object Detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 27 October 2019 – 02 November 2019, Seoul, Korea (South). IEEE; 2019. P. 9536–9545. https://doi.org/10.1109/ICCV.2019.00963
5. Pang Y., Xie J., Khan M.H., Anwer R.M., Khan F.S., Shao L. Mask-Guided Attention Network for Occluded Pedestrian Detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 27 October 2019 – 02 November 2019, Seoul, Korea (South). IEEE; 2019. P. 4966–4974. https://doi.org/10.1109/ICCV.2019.00507
6. Gupta J., Malik J. Visual Semantic Role Labeling. URL: https://doi.org/10.48550/arXiv.1505.04474 (Accessed 19th March 2024).
7. Shtekhin S.E., Karachev D.K., Ivanova Yu.A. Computer vision system for Working time estimation by Human Activities detection in video frames. Trudy Instituta sistemnogo programmirovaniya RAN = Proceedings of the Institute for System Programming of the RAS. 2020;32(1):121–136. (In Russ.). https://doi.org/10.15514/ISPRAS-2020-32(1)-7
8. Mitchell R.E., Young T.A., Bachleda B., Karchmer M.A. How Many People Use ASL in the United States? Why Estimates Need Updating. Sign Language Studies. 2006;6(3):306–335. https://doi.org/10.1353/sls.2006.0019
9. Kim T. American Sign Language fingerspelling recognition from video: Methods for unrestricted recognition and signer-independence. URL: https://doi.org/10.48550/arXiv.1608.08339 (Accessed 19th March 2024).
10. Suresh S., Mithun H.T.P, Supriya M.H. Sign Language Recognition System Using Deep Neural Network. In: 2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS), 15-16 March 2019, Coimbatore, India. IEEE; 2019. P. 614–618. https://doi.org/10.1109/ICACCS.2019.8728411
11. Kim S., Ji Y., Lee K.-B. An Effective Sign Language Learning with Object Detection Based ROI Segmentation. In: 2018 Second IEEE International Conference on Robotic Computing (IRC), 31 January 2018 – 02 February 2018, Laguna Hills, CA, USA. IEEE; 2018. P. 330–333. https://doi.org/10.1109/IRC.2018.00069
12. Shivashankara S., Srinath S. A Review on Vision Based American Sign Language Recognition, its Techniques, and Outcomes. In: 2017 7th International Conference on Communication Systems and Network Technologies (CSNT), 11-13 November 2017, Nagpur, India. IEEE; 2017. P. 293–299. https://doi.org/10.1109/CSNT.2017.8418554
13. Kumar R., Bajpai A., Sinha A. Mediapipe and CNNs for Real-Time ASL Gesture Recognition. URL: https://doi.org/10.48550/arXiv.2305.05296 (Accessed 19th March 2024).
14. Akandeh A. Sentence-Level Sign Language Recognition Framework. URL: https://doi.org/10.48550/arXiv.2211.14447 (Accessed 23rd March 2024).
15. Lee C.K.M. et al. American sign language recognition and training method with recurrent neural network. Expert Systems with Applications. 2021;167. https://doi.org/10.1016/j.eswa.2020.114403
16. Jayanthi P., Ponsy R.K., Bhama S.P.R., Madhubalasri B. Sign Language Recognition using Deep CNN with Normalised Keyframe Extraction and Prediction using LSTM. Journal of Scientific and Industrial Research. 2023;82(7):745–755.
17. Ryumin D. Automated hand detection method for tasks of gesture recognition in human-machine interfaces. Nauchno-tekhnicheskii vestnik informatsionnykh tekhnologii, mekhaniki i optiki = Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 2020;20(4):525–531. (In Russ.). https://doi.org/10.17586/2226-1494-2020-20-4-525-531
18. Tekin B., Bogo F., Pollefeys M. H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 15-20 June 2019, Long Beach, CA, USA. IEEE; 2019. P. 4506–4515. https://doi.org/10.1109/CVPR.2019.00464
19. Li D., Opazo C.R., Yu X., Li H. Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 01-05 March 2020, Snowmass, CO, USA. IEEE; 2020. P. 1448–1458. https://doi.org/10.1109/WACV45572.2020.9093512
20. Supančič Ja.S., Rogez G., Yang Yi., Shotton Ja., Ramanan D. Depth-Based Hand Pose Estimation: Methods, Data, and Challenges. International Journal of Computer Vision. 2018;126(11):1180–1198. https://doi.org/10.1007/s11263-018-1081-7
21. Ivashechkin M., Mendez O., Bowden R. Improving 3D Pose Estimation for Sign Language. URL: https://doi.org/10.48550/arXiv.2308.09525 (Accessed 25th March 2024).
22. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł. Polosukhin I. Attention Is All You Need. In: NIPS'17: 31st International Conference on Neural Information Processing Systems: Advances in Neural Information Processing Systems 30 (NIPS 2017), 4-9 December 2017, Long Beach, CA, USA. Montreal: Curran Associates; 2017. P. 5998–6008.
23. Goodfellow I., Bengio Y., Courville A. Deep Learning. Cambridge: MIT Press; 2016. 800 p.
24. Brown T., Mann B., Ryder N., Subbiah M., Kaplan J.D., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A. et al. Language Models are Few-Shot Learners. In: NeurIPS 2020: 34th Conference on Neural Information Processing Systems: Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 06-12 December 2020, Vancouver, Canada. Curran Associates; 2020. P. 1877–1901.
25. Touvron H. et al. LLaMA: Open and Efficient Foundation Language Models. URL: https://doi.org/10.48550/arXiv.2302.13971 (Accessed 5th April 2024).
26. Lugaresi C., Tang J., Nash H. et al. MediaPipe: A Framework for Building Perception Pipelines. URL: https://doi.org/10.48550/arXiv.1906.08172 (Accessed 5th April 2024).
Keywords: video analysis of hand movements, gesture recognition, action recognition, deep neural networks, transformer, technological operations
For citation: Shtekhin Sergei Evgenievich S.E., Stadnik A.V. Study of deep learning models in the task of recognizing technological operations as a sequence of hand movements. Modeling, Optimization and Information Technology. 2024;12(2). URL: https://moitvivt.ru/ru/journal/pdf?id=1574 DOI: 10.26102/2310-6018/2024.45.2.035 (In Russ).
Received 06.06.2024
Revised 19.06.2024
Accepted 27.06.2024
Published 30.06.2024