Reward function verification methodology for training locomotion policies of a quadruped robot

idGeroyev A.S., idGerget O.M., Bashkirova A.V., Filchenkov A.A.

UDC 004.896
DOI: 10.26102/2310-6018/2026.56.5.003

Abstract
List of references
About authors

This article proposes an approach to reward function modeling through sequential testing of its functional components. Incorrect functional components can lead to the maximum value of the resulting function no longer corresponding to the desired robot behavior. To address this issue and to preliminarily evaluate the function itself, a verification method was proposed that allows for the systematic verification of both individual reward function components and their weighting coefficients before beginning time-consuming and resource-intensive policy training. The method involves generating a set of desirable and undesirable robot behavior scenarios for subsequent evaluation of the reward function and its functional components. A two-level testing method is proposed: at the first level, individual functional components responsible for maintaining desired robot motion criteria, such as maintaining target speed, maintaining target body stability, maintaining target body height, etc., are tested for monotonic decrease in undesirable states. At the second level, the resulting function of the weighted sum of these components is tested to ensure that weight imbalances do not lead to increased reward during instability, falls, or movement at an undesirable speed in an undesirable direction. Particular attention is paid to testing for compliance with the desired state – a scenario of ideal linear motion—which helps identify "incorrect" sets of coefficients where penalizing components dominate even under ideal conditions. Experimental validation was conducted on a Unitree Go1 robot model in the PyBullet environment. The results confirm that the proposed tests effectively identify component implementation errors and weight imbalances, significantly increasing the reliability of the training process and reducing development time.

1. Schulman J., Wolski F., Dhariwal P., Radford A., Klimov O. Proximal policy optimization algorithms. arXiv. URL: https://arxiv.org/abs/1707.06347 [Accessed 5th February 2026].

2. Tobin J., Fong R., Ray A., et al. Domain randomization for transferring deep neural networks from simulation to the real world. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 24–28 September 2017, Vancouver, BC, Canada. IEEE; 2017. P. 23–30. https://doi.org/10.1109/IROS.2017.8202133

3. Muratore F., Gienger M., Peters J. Assessing transferability from simulation to reality for reinforcement learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2021;43(4):1172–1183. https://doi.org/10.1109/TPAMI.2019.2952353

4. Ma Y.J., Liang W., Wang H.-J., et al. DrEureka: Language Model Guided Sim-To-Real Transfer. In: Robotics: Science and Systems 2024, 15–19 July 2024, Delft, The Netherlands. 2024. https://doi.org/10.15607/RSS.2024.XX.094

5. Kim M.-S., Kim J.-S., Park J.-H. Automated Hyperparameter Tuning in Reinforcement Learning for Quadrupedal Robot Locomotion. Electronics. 2024;13(1). https://doi.org/10.3390/electronics13010116

6. Hwangbo J., Lee J., Dosovitskiy A., et al. Learning agile and dynamic motor skills for legged robots. Science Robotics. 2019;4(26). https://doi.org/10.1126/scirobotics.aau5872

7. Bellegarda G., Chen Y., Liu Zh., Nguyen Q. Robust High-speed Running for Quadruped Robots via Deep Reinforcement Learning. arXiv. URL: https://arxiv.org/abs/2103.06484 [Accessed 12th February 2026].

8. Zhao Y., Wu T., Zhu Y., et al. ZSL-RPPO: Zero-Shot Learning for Quadrupedal Locomotion in Challenging Terrains using Recurrent Proximal Policy Optimization. arXiv. URL: https://arxiv.org/abs/2403.01928 [Accessed 5th February 2026].

9. Van Marum B., Shrestha A., Duan H., et al. Revisiting Reward Design and Evaluation for Robust Humanoid Standing and Walking. arXiv. URL: https://arxiv.org/abs/2404.19173 [Accessed 10th February 2026].

10. Soni R., Harnack D., Isermann H., et al. End-to-End Reinforcement Learning for Torque Based Variable Height Hopping. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 01–05 October 2023, Detroit, MI, USA. IEEE; 2023. P. 7531–7538. https://doi.org/10.1109/IROS55552.2023.10342187

Geroyev Alexander Sergeevich

Scopus | ORCID | eLibrary |

V.A. Trapeznikov Institute of Control Sciences of Russian Academy of Sciences
Applied Robotics LLC

Moscow, Russian Federation

Gerget Olga Mikhailovna
Doctor of Engineering Sciences, Docent

WoS | Scopus | ORCID | eLibrary |

V.A. Trapeznikov Institute of Control Sciences of Russian Academy of Sciences

Moscow, Russian Federation

Bashkirova Anastasiia Viacheslavovna

V.A. Trapeznikov Institute of Control Sciences of Russian Academy of Sciences

Moscow, Russian Federation

Filchenkov Alexander Alexandrovich

Moscow Polytechnic University
Applied Robotics LLC

Moscow, Russian Federation

Keywords: reinforcement learning, environment of a quadruped robot, intelligent agent, state space, action space, reward function, locomotion

For citation: Geroyev A.S., Gerget O.M., Bashkirova A.V., Filchenkov A.A. Reward function verification methodology for training locomotion policies of a quadruped robot. Modeling, Optimization and Information Technology. 2026;14(5). URL: https://moitvivt.ru/ru/journal/article?id=2272 DOI: 10.26102/2310-6018/2026.56.5.003 (In Russ).

160

Full text in PDF

Скачать JATS XML

Received 06.03.2026

Revised 27.04.2026

Accepted 11.05.2026

Published 31.05.2026