Method of human emotion recognition through analysis of body motor activity in a video stream using neural networks.

idUzdiaev M.Y. idDudarenko D.M. Mironov V.N.

UDC 004.032.26
DOI: 10.26102/2310-6018/2021.32.1.004

Abstract
List of references
About authors

This paper presents the use of various neural network models to solve the problem of human emotion recognition by the motor activity of his body on frames of a video stream without complex preprocessing of these frames. The paper presents three-dimensional convolutional neural networks: Inception 3D (I3D), Residual 3D (R3D), as well as convolutional-recurrent neural network architectures using the convolutional neural network of the ResNet architecture and recurrent neural networks of the LSTM and GRU architectures (ResNet + LSTM, ResNet + GRU) which do not require preliminary processing of images or video stream and at the same time potentially allow achieving high accuracy of emotion recognition. Based on the considered architectures, a method for human emotion recognition from the motor activity of the body in a video stream is proposed. Architectural features of the used models, methods of processing video stream frames by models, as well as the results of emotion recognition according to the following quality metrics: the proportion of correctly recognized instances (accuracy), precision, recall are discussed. Approbation results of the proposed neural network models I3D, R3D, ResNet + LSTM, ResNet + GRU on the FABO data set showed a high quality of emotion recognition based on the motor activity of the human body. Thus, the R3D model showed the best share of correctly recognized copies, equal to 91%. Other proposed models: I3D, ResNet + LSTM, ResNet + GRU showed 88%, 80% and 80% recognition accuracy, respectively. Therefore, according to the obtained results of the experimental evaluation of the proposed neural network models, the most preferable for use in solving the problem of a person's emotional state recognition by motor activity, from the point of view of a set of indicators of the accuracy of emotion classification, are three-dimensional convolutional models I3D and R3D. At the same time, the proposed models, in contrast to most existing solutions, make it possible to implement emotion recognition based on the analysis of RGB frames of a video stream without performing their preliminary resource-consuming processing, as well as to perform emotion recognition in real-time with high accuracy.

1. Vatamaniuk I.V., Yakovlev R.N. Algorithmic model of a distributed corporate notification system in context of a corporate cyber-physical system. Modeling, optimization and information technology. 2019;7(4). Available at: https://moit.vivt.ru/wp-content/uploads/2019/11/VatamanukSoavtori_4_19_1.pdf. (In Russ). DOI: 10.26102/2310-6018/2019.27.4.026 (accessed 20.10.2020).

2. Letenkov M., Levonevskiy D. Fast Face Features Extraction Based on Deep Neural Networks for Mobile Robotic Platforms. International Conference on Interactive Collaborative Robotics. Springer, Cham. 2020:200-211. DOI: 10.1007/978-3-030-60337-3_20

3. Vatamaniuk I. V., Iakovlev R. N. Generalized Theoretical Models of Cyberphysical Systems. Izvestiya Yugo-Zapadnogo gosudarstvennogo universiteta = Proceedings of the Southwest State University. 2019;23(6):161-175 (In Russ.). Available at: https://science.swsu.ru/jour/article/view/666/489. DOI: 10.21869/2223-1560-2019-23-6-161-175 (accessed 20.10.2020).

4. Frijda N.H. Emotions and action. Feelings and emotions: The Amsterdam symposium. 2004:158-173.

5. He G., Liu X., Fan F., You J. Image2Audio: Facilitating Semi-supervised Audio Emotion Recognition with Facial Expression Image. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020:912-913.

6. Kalsum T., Anwar S.M., Majid M., Khan B., Ali S.M. Emotion recognition from facial expressions using hybrid feature descriptors. IET Image Processing. 2018;12(6):1004-1012.

7. Levonevskii D., Shumskaya O., Velichko A., Uzdiaev M., Malov D. Methods for Determination of Psychophysiological Condition of User Within Smart Environment Based on Complex Analysis of Heterogeneous Data. Proceedings of 14th International Conference on Electromechanics and Robotics «Zavalishin's Readings». Springer, Singapore. 2020:511-523.

8. Uzdiaev M., Levonevskii D., Shumskaya O., Letenkov M. Methods for detecting aggressive users of the information space based on generative-competitive neural networks. "Informatsionno-izmeritelnye i upravlyayushchie sistemy" (Information-measuring and Control Systems). 2019;17(5):60-68. (In Russ).

9. Uzdiaev M. Methods of Multimodal Data Fusion and Forming Latent Representation in the Human Aggression Recognition Task. 2020 IEEE 10th International Conference on Intelligent Systems (IS). IEEE. 2020:399-403.

10. Thakur N., Han C.Y. A complex activity based emotion recognition algorithm for affect aware systems. 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC). IEEE. 2018:748-753.

11. Wu J., Zhang Y., Ning L. The Fusion Knowledge of Face, Body and Context for Emotion Recognition. 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE. 2019:108-113.

12. Piana S., Staglianò A., Odone F., Camurri A. Adaptive body gesture representation for automatic emotion recognition. ACM Transactions on Interactive Intelligent Systems (TiiS). 2016;6(1):1-31.

13. Ly S.T., Lee G.S., Kim S.H., Yang H.J. Emotion Recognition via Body Gesture: Deep Learning Model Coupled with Keyframe Selection. Proceedings of the 2018 International Conference on Machine Learning and Machine Intelligence. 2018:27-31.

14. Shen Z., Cheng J., Hu X., Dong Q. Emotion Recognition Based on Multi-View Body Gestures. 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019:3317-3321.

15. Targ S., Almeida D., Lyman K. Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:1603.08029. 2016.

16. Carreira J., Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:6299-6308.

17. Hara K., Kataoka H., Satoh Y. Learning spatio-temporal features with 3D residual networks for action recognition. Proceedings of the IEEE International Conference on Computer Vision Workshops. 2017:3154-3160.

18. Deng J., Dong W., Socher R., Li L. J., Li K., Fei-Fei L. Imagenet: A large-scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition. IEEE. 2009:248-255.

19. Vinyals O., Toshev A., Bengio S., Erhan D. Show and tell: A neural image caption generator. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015:3156-3164.

20. Xu K., Ba J., Kiros R., Cho K., Courville A., Salakhudinov R., Bengio Y. Show, attend and tell: Neural image caption generation with visual attention. International conference on machine learning. 2015:2048-2057.

21. Yao L., Torabi A., Cho K., Ballas N., Pal C., Larochelle H., Courville A. Describing videos by exploiting temporal structure. Proceedings of the IEEE international conference on computer vision. 2015:4507-4515.

22. Hori C., Hori T., Lee T. Y., Zhang Z., Harsham B., Hershey J. R., Sumi K. Attention-based multimodal fusion for video description. Proceedings of the IEEE international conference on computer vision. 2017:4193-4202.

23. Yue-Hei Ng, J., Hausknecht M., Vijayanarasimhan S., Vinyals O., Monga R., Toderici G. Beyond short snippets: Deep networks for video classification. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015:4694-4702.

24. Ullah A., Ahmad J., Muhammad K., Sajjad M., Baik S. W. Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access. 2017;6:1155-1166.

25. Girshick R. Fast r-cnn. Proceedings of the IEEE international conference on computer vision. 2015:1440-1448.

26. Ren S., He K., Girshick R., Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems. 2015:91-99.

27. Redmon J., Divvala S., Girshick R., Farhadi A. You only look once: Unified, real-time object detection. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016:779-788.

28. Liu W., Anguelov D., Erhan D., Szegedy C., Reed S., Fu C.Y., Berg A.C. Ssd: Single shot multibox detector. European conference on computer vision. Springer, Cham, 2016:21-37.

29. Pan S.J., Yang Q. A survey on transfer learning. IEEE Transactions on knowledge and data engineering. 2009;22(10):1345-1359.

30. Weiss K., Khoshgoftaar T.M., Wang D.D. A survey of transfer learning. Journal of Big data. 2016;3(1):9.

31. He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016:770-778.

32. Hochreiter S., Schmidhuber J. Long short-term memory. Neural computation. 1997;9(8):1735-1780.

33. Chung J., Gulcehre C., Cho K., Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. 2014.

34. Tran D., Bourdev L., Fergus R., Torresani L., Paluri M. Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE international conference on computer vision. 2015:4489-4497.

35. Hara K., Kataoka H., Satoh Y. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2018:6546-6555.

36. Saveliev A., Uzdiaev M., Dmitrii M. Aggressive Action Recognition Using 3D CNN Architectures. 2019 12th International Conference on Developments in eSystems Engineering (DeSE). IEEE. 2019:890-895.

37. Kay W., Carreira J., Simonyan K., Zhang B., Hillier C., Vijayanarasimhan S., Suleyman M. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. 2017.

38. Szegedy C., Liu W., Jia Y., Sermanet P., Reed S., Anguelov D., Rabinovich A. Going deeper with convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015:1-9.

39. Gunes H., Piccardi M. A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior. 18th International Conference on Pattern Recognition (ICPR'06). IEEE. 2006;1:1148-1153.

40. Kingma D. P., Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. 2014.

41. Gunes H., Piccardi M. Automatic temporal segment detection and affect recognition from face and body display. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). 2008;39(1):64-84.

42. Chen S., Tian Y., Liu Q., Metaxas D.N. Recognizing expressions from face and body gesture by temporal normalized motion and appearance features. Image and Vision Computing. 2013;31(2):175-185.

43. Barros P., Jirak D., Weber C., Wermter S. Multimodal emotional state recognition using sequence-dependent deep hierarchical features. Neural Networks. 2015;72:140-151.

44. Bahdanau D., Cho K., Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. 2014.

Uzdiaev Mikhail Yurievich

Email: m.y.uzdiaev@gmail.com

ORCID |

St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences

Saint-Petersburg, Russian Federation

Dudarenko Dmitry Mikhailovich

ORCID |

St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences

Saint-Petersburg, Russian Federation

Mironov Viktor Nikolaevich

St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences

Saint-Petersburg, Russian Federation

Keywords: neural network model, emotion recognition, convolutional neural networks, machine learning, image processing, video stream

For citation: Uzdiaev M.Y. Dudarenko D.M. Mironov V.N. Method of human emotion recognition through analysis of body motor activity in a video stream using neural networks.. Modeling, Optimization and Information Technology. 2021;9(1). Available from: https://moitvivt.ru/ru/journal/pdf?id=929 DOI: 10.26102/2310-6018/2021.32.1.004 (In Russ).

827

Full text in PDF

Revised 15.02.2021

Published 24.03.2021