References

moitvivt

Моделирование, оптимизация и информационные технологии

Modeling, Optimization and Information Technology

2310-6018

Издательство

10.26102/2310-6018/2026.56.5.003

2272

Методика верификации функции вознаграждения для обучения политик локомоции четвероногого робота

Reward function verification methodology for training locomotion policies of a quadruped robot

0009-0000-1280-4709

Героев

Александр Сергеевич

Geroyev

Alexander Sergeevich

geroev_sasha@mail.ru aff-1

0000-0002-6242-9502

Гергет

Ольга Михайловна

Gerget

Olga Mikhailovna

olgagerget@mail.ru aff-2

Башкирова

Анастасия Вячеславовна

Bashkirova

Anastasiia Viacheslavovna

basana235@yandex.ru aff-3

Фильченков

Александр Александрович

Filchenkov

Alexander Alexandrovich

al.filchenkov@gmail.com aff-4

Институт проблем управления имени В.А. Трапезникова РАН ООО "ПРИКЛАДНАЯ РОБОТОТЕХНИКА" V.A. Trapeznikov Institute of Control Sciences of Russian Academy of Sciences Applied Robotics LLC

Институт проблем управления имени В.А. Трапезникова РАН V.A. Trapeznikov Institute of Control Sciences of Russian Academy of Sciences

Московский политехнический университет ООО "ПРИКЛАДНАЯ РОБОТОТЕХНИКА" Moscow Polytechnic University Applied Robotics LLC

01 01 2026

1 1

10.26102/2310-6018/2026.56.5.003

2026

This work is licensed under a Creative Commons Attribution 4.0 International License

В статье предложен подход к моделированию функции вознаграждения путем последовательного тестирования ее функциональных компонент. Некорректные функциональные компоненты могут привести к тому, что максимальное значение результирующей функции перестанет соответствовать желаемому целевому поведению робота. Для решения этой проблемы, а также предварительной оценки самой функции была предложена методика верификации, позволяющая проводить систематическую проверку как отдельных компонент функции вознаграждения, так и их весовых коэффициентов до начала длительного и ресурсоемкого обучения политики. Методика включает в себя формирование набора желательных и нежелательных сценариев поведения робота для последующей оценки изменения функции вознаграждения и ее функциональных компонент. Предложен двухуровневый метод тестирования: на первом уровне тестируются отдельные функциональные компоненты, отвечающие за соблюдение желаемых критериев движения робота, таких как сохранение целевой скорости, сохранение целевой устойчивости корпуса, сохранение целевой высоты корпуса и т. д. на предмет их монотонного убывания в нежелательных состояниях. На втором уровне тестируется результирующая функция взвешенной суммы этих компонент, чтобы убедиться, что дисбаланс весов не приводит к росту награды при потере устойчивости, падении или движению с нежелательной скоростью в нежелательном направлении. Особое внимание уделяется тесту на соответствие желательному состоянию – сценарию идеального прямолинейного движения, который позволяет выявить «некорректные» наборы коэффициентов, при которых штрафующие компоненты доминируют даже в идеальных условиях. Экспериментальная проверка проведена на модели робота Unitree Go1 в среде PyBullet. Результаты подтверждают, что предложенные тесты эффективно выявляют ошибки в реализации компонент и дисбаланс весов, что существенно повышает надежность процесса обучения и сокращает временные затраты на разработку.

This article proposes an approach to reward function modeling through sequential testing of its functional components. Incorrect functional components can lead to the maximum value of the resulting function no longer corresponding to the desired robot behavior. To address this issue and to preliminarily evaluate the function itself, a verification method was proposed that allows for the systematic verification of both individual reward function components and their weighting coefficients before beginning time-consuming and resource-intensive policy training. The method involves generating a set of desirable and undesirable robot behavior scenarios for subsequent evaluation of the reward function and its functional components. A two-level testing method is proposed: at the first level, individual functional components responsible for maintaining desired robot motion criteria, such as maintaining target speed, maintaining target body stability, maintaining target body height, etc., are tested for monotonic decrease in undesirable states. At the second level, the resulting function of the weighted sum of these components is tested to ensure that weight imbalances do not lead to increased reward during instability, falls, or movement at an undesirable speed in an undesirable direction. Particular attention is paid to testing for compliance with the desired state – a scenario of ideal linear motion—which helps identify "incorrect" sets of coefficients where penalizing components dominate even under ideal conditions. Experimental validation was conducted on a Unitree Go1 robot model in the PyBullet environment. The results confirm that the proposed tests effectively identify component implementation errors and weight imbalances, significantly increasing the reliability of the training process and reducing development time.

обучение с подкреплением окружение четвероногого робота интеллектуальный агент пространство состояний пространство действий функция вознаграждения локомоция

reinforcement learning environment of a quadruped robot intelligent agent state space action space reward function locomotion

Исследование выполнено без спонсорской поддержки.

The study was performed without external funding.

References 1

Schulman J., Wolski F., Dhariwal P., Radford A., Klimov O. Proximal policy optimization algorithms. arXiv. URL: https://arxiv.org/abs/1707.06347 [Accessed 5th February 2026].

Tobin J., Fong R., Ray A., et al. Domain randomization for transferring deep neural networks from simulation to the real world. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 24–28 September 2017, Vancouver, BC, Canada. IEEE; 2017. P. 23–30. https://doi.org/10.1109/IROS.2017.8202133

Muratore F., Gienger M., Peters J. Assessing transferability from simulation to reality for reinforcement learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2021;43(4):1172–1183. https://doi.org/10.1109/TPAMI.2019.2952353

Ma Y.J., Liang W., Wang H.-J., et al. DrEureka: Language Model Guided Sim-To-Real Transfer. In: Robotics: Science and Systems 2024, 15–19 July 2024, Delft, The Netherlands. 2024. https://doi.org/10.15607/RSS.2024.XX.094

Kim M.-S., Kim J.-S., Park J.-H. Automated Hyperparameter Tuning in Reinforcement Learning for Quadrupedal Robot Locomotion. Electronics. 2024;13(1). https://doi.org/10.3390/electronics13010116

Hwangbo J., Lee J., Dosovitskiy A., et al. Learning agile and dynamic motor skills for legged robots. Science Robotics. 2019;4(26). https://doi.org/10.1126/scirobotics.aau5872

Bellegarda G., Chen Y., Liu Zh., Nguyen Q. Robust High-speed Running for Quadruped Robots via Deep Reinforcement Learning. arXiv. URL: https://arxiv.org/abs/2103.06484 [Accessed 12th February 2026].

Zhao Y., Wu T., Zhu Y., et al. ZSL-RPPO: Zero-Shot Learning for Quadrupedal Locomotion in Challenging Terrains using Recurrent Proximal Policy Optimization. arXiv. URL: https://arxiv.org/abs/2403.01928 [Accessed 5th February 2026].

Van Marum B., Shrestha A., Duan H., et al. Revisiting Reward Design and Evaluation for Robust Humanoid Standing and Walking. arXiv. URL: https://arxiv.org/abs/2404.19173 [Accessed 10th February 2026].

Soni R., Harnack D., Isermann H., et al. End-to-End Reinforcement Learning for Torque Based Variable Height Hopping. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 01–05 October 2023, Detroit, MI, USA. IEEE; 2023. P. 7531–7538. https://doi.org/10.1109/IROS55552.2023.10342187

The authors declare that there are no conflicts of interest present.