基于深度强化学习的双轮足机器人控制
强化学习算法性能基准与核心架构优化
这些文献侧重于不同深度强化学习算法(如SAC、PPO、DDPG、TD3)在双足机器人控制中的性能对比、超参数优化、动作空间设计(力矩vs位置控制)以及核心训练架构的改进(如并行计算、经验回放、记忆增强网络LSTM-SAC)。
- Human Walking Computational Models using Reinforcement Learning(Jayalakshmi Murugan, K. Maharajan, M. Vijay, S. Selvi, 2024, 2024 5th International Conference on Innovative Trends in Information Technology (ICITIIT))
- Teaching a Robot to Walk Using Reinforcement Learning(Jack Dibachi, Jacob Azoulay, 2021, ArXiv)
- Learning Multi-Skill Locomotion in Underactuated Biped: A Waypoint-Based Reward Shaping Approach(Jagannath Prasad Sahoo, Surya Prakash S.K., D. Prajapati, K. Pant, Abhay Dwivedi, Amit Shukla, 2025, 2025 Eleventh Indian Control Conference (ICC))
- Comparative Analysis of Reinforcement Learning Algorithms for Bipedal Robot Locomotion(O. Aydogmus, Musa Yilmaz, 2024, IEEE Access)
- Parallel Deep Reinforcement Learning Method for Gait Control of Biped Robot(C. Tao, J. Xue, Zufeng Zhang, Zhen Gao, 2022, IEEE Transactions on Circuits and Systems II: Express Briefs)
- M-A3C: A Mean-Asynchronous Advantage Actor-Critic Reinforcement Learning Method for Real-Time Gait Planning of Biped Robot(Jie Leng, Suozhong Fan, Jun Tang, Haiming Mou, Junxiao Xue, Qingdu Li, 2022, IEEE Access)
- A Reinforcement Learning Method for Humanoid Robot Walking(Yunda Liu, Sheng Bi, Min Dong, Yingjie Zhang, Jialing Huang, Jiawei Zhang, 2018, 2018 IEEE 8th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER))
- Understanding the stability of deep control policies for biped locomotion(Hwangpil Park, R. Yu, Yoonsang Lee, Kyungho Lee, Jehee Lee, 2020, The Visual Computer)
- Impact of Control Frequency on Deep RL-Based Torque Controller for Bipedal Locomotion(Junhyeok Cha, Donghyeon Kim, Jaeheung Park, 2023, No journal)
- A Multiobjective Collaborative Deep Reinforcement Learning Algorithm for Jumping Optimization of Bipedal Robot(C. Tao, Mengru Li, Feng Cao, Zhen Gao, Zufeng Zhang, 2023, Advanced Intelligent Systems)
- Distillation-PPO: A Novel Two-Stage Reinforcement Learning Framework for Humanoid Robot Perceptive Locomotion(Qiang Zhang, Gang Han, Jingkai Sun, Wen Zhao, Chenghao Sun, Jiahang Cao, Jiaxu Wang, Yijie Guo, Renjing Xu, 2025, 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- Reward-Adaptive Reinforcement Learning: Dynamic Policy Gradient Optimization for Bipedal Locomotion(Changxin Huang, Guangrun Wang, Zhibo Zhou, Ronghui Zhang, Liang Lin, 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence)
- Memory-based Deep Reinforcement Learning for Humanoid Locomotion under Noisy Scenarios(Samuel Chenatti, E. Colombini, 2022, 2022 Latin American Robotics Symposium (LARS), 2022 Brazilian Symposium on Robotics (SBR), and 2022 Workshop on Robotics in Education (WRE))
- Biped Robot Walking based on Deep Reinforcement Learning(Tomislav Tadić, Petar Curkovic, 2023, 2023 46th MIPRO ICT and Electronics Convention (MIPRO))
- Deep Reinforcement Learning for Humanoid Robot Behaviors(Alex Muzio, M. Maximo, Takashi Yoneyama, 2022, Journal of Intelligent & Robotic Systems)
- Hybrid and dynamic policy gradient optimization for bipedal robot locomotion(Changxin Huang, Jiang Su, Zhihong Zhang, Dong Zhao, Liang Lin, 2021, ArXiv)
- Comparison of Reinforcement Learning Algorithms for Motion Control of an Autonomous Robot in Gazebo Simulator(Daniil A. Kozlov, 2021, 2021 International Conference on Information Technology and Nanotechnology (ITNT))
- A parallel heterogeneous policy deep reinforcement learning algorithm for bipedal walking motion design(Chunguang Li, Mengru Li, C. Tao, 2023, Frontiers in Neurorobotics)
- Bipedal Walking Robot using Deep Deterministic Policy Gradient(Arun Kumar, Navneet Paul, S. Omkar, 2018, ArXiv)
- Sequential sensor fusion-based W-DDPG gait controller of bipedal robots for adaptive slope walking(Ping-Huan Kuo, Jun Hu, Kuan-Lin Chen, W. Chang, Xin-Yu Chen, Chiou-Jye Huang, 2023, Adv. Eng. Informatics)
模型驱动(MPC/WBC/HZD)与强化学习的混合控制
该组论文探讨如何将传统的机器人控制模型(如线性倒立摆LIPM、模型预测控制MPC、全身控制WBC、混合零动力学HZD)与强化学习相结合,利用物理先验知识引导学习过程,提高控制的可解释性、精确度和动态稳定性。
- RL-augmented MPC Framework for Agile and Robust Bipedal Footstep Locomotion Planning and Control(S. Bang, Carlos Arribalzaga Jov'e, Luis Sentis, 2024, 2024 IEEE-RAS 23rd International Conference on Humanoid Robots (Humanoids))
- Integrating Model-Based Footstep Planning with Model-Free Reinforcement Learning for Dynamic Legged Locomotion(H. Lee, Seungwoo Hong, Sangbae Kim, 2024, 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- CLF-RL: Control Lyapunov Function Guided Reinforcement Learning(Kejun Li, Zachary Olkin, Yisong Yue, Aaron D. Ames, 2025, IEEE Robotics and Automation Letters)
- Template Model Inspired Task Space Learning for Robust Bipedal Locomotion(Guillermo A. Castillo, Bowen Weng, Shunpeng Yang, Wei Zhang, Ayonga Hereid, 2023, 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- LIPM-Guided Reinforcement Learning for Stable and Perceptive Locomotion in Bipedal Robots(H. Su, Haoxiang Luo, Shunpeng Yang, Kaiwen Jiang, Wei Zhang, Hua Chen, 2025, 2025 IEEE-RAS 24th International Conference on Humanoid Robots (Humanoids))
- Rambo: RL-Augmented Model-Based Whole-Body Control for Loco-Manipulation(Jin Cheng, Dongho Kang, Gabriele Fadini, Guanya Shi, Stelian Coros, 2025, IEEE Robotics and Automation Letters)
- Extended Hybrid Zero Dynamics for Bipedal Walking of the Knee-less Robot SLIDER(Rui Zong, M. Liang, Yu Fang, Ke Wang, Xiaoshuai Chen, Wei Chen, Petar Kormushev, 2025, ArXiv)
- Fusing Dynamics and Reinforcement Learning for Control Strategy: Achieving Precise Gait and High Robustness in Humanoid Robot Locomotion*(Zida Zhao, Haodong Huang, Shilong Sun, Chiyao Li, Wenfu Xu, 2024, 2024 IEEE-RAS 23rd International Conference on Humanoid Robots (Humanoids))
- RL-augmented Adaptive Model Predictive Control for Bipedal Locomotion over Challenging Terrain(Junnosuke Kamohara, Feiyang Wu, Chinmayee Wamorkar, Seth Hutchinson, Ye Zhao, 2025, ArXiv)
- Reinforcement Learning-Based Parameter Optimization for Whole-Body Admittance Control with IS-MPC(Nícolas F. Figueroa, Julio Tafur, A. Kheddar, 2024, 2024 IEEE/SICE International Symposium on System Integration (SII))
- Planning and Execution of Dynamic Whole-body Locomotion for a Wheeled Biped Robot on Uneven Terrain(Yaxian Xin, Yibin Li, Hui Chai, Xuewen Rong, Jiuhong Ruan, 2024, International Journal of Control, Automation and Systems)
- A learning-based model predictive control scheme and its application in biped locomotion(Jingchao Li, Zhaohui Yuan, Sheng Dong, Xiaoyue Sang, Jian Kang, 2022, Eng. Appl. Artif. Intell.)
- An Adaptive Approach to Whole-Body Balance Control of Wheel-Bipedal Robot Ollie(Jingfan Zhang, Shuai Wang, Haitao Wang, Jie Lai, Zhenshan Bing, Yu Jiang, Yu Zheng, Zhengyou Zhang, 2022, 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- Research on Whole-Body Coordinated Motion of Humanoid Robots Based on LSTM-Integrated Reinforcement Learning(Tianyu Yuan, Chaoyi Dong, Ge Tai, Shuai Xiang, Haoda Yan, Zhifeng Kong, Chenzhe Zhang, Xiaoyan Chen, 2025, 2025 11th International Conference on Control, Decision and Information Technologies (CoDIT))
- NaviGait: Navigating Dynamically Feasible Gait Libraries using Deep Reinforcement Learning(Neil C. Janwani, Varun Madabushi, Maegan Tucker, 2025, ArXiv)
- Robust Feedback Motion Policy Design Using Reinforcement Learning on a 3D Digit Bipedal Robot(Guillermo A. Castillo, Bowen Weng, Wei Zhang, Ayonga Hereid, 2021, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- Optimizing Bipedal Maneuvers of Single Rigid-Body Models for Reinforcement Learning(Ryan Batke, Fang-Yu Yu, Jeremy Dao, J. Hurst, R. Hatton, Alan Fern, Kevin R. Green, 2022, 2022 IEEE-RAS 21st International Conference on Humanoid Robots (Humanoids))
Sim-to-Real 迁移技术与抗扰动鲁棒性增强
专注于解决“仿真到现实”的差距问题,通过域随机化、系统辨识、电机动力学建模以及专门的抗干扰训练(如推力恢复、捕获点理论、风力负载适应),确保策略在真实物理机器人上的部署效果与平衡能力。
- Sim-to-Real Transfer of Compliant Bipedal Locomotion on Torque Sensor-Less Gear-Driven Humanoid(Shimpei Masuda, K. Takahashi, 2022, 2023 IEEE-RAS 22nd International Conference on Humanoid Robots (Humanoids))
- Learning Bipedal Locomotion on Gear-Driven Humanoid Robot Using Foot-Mounted IMUs(Sotaro Katayama, Yuta Koda, Norio Nagatsuka, Masaya Kinoshita, 2025, 2025 IEEE-RAS 24th International Conference on Humanoid Robots (Humanoids))
- Robust Humanoid Walking on Compliant and Uneven Terrain with Deep Reinforcement Learning(Rohan P. Singh, M. Morisawa, M. Benallegue, Zhaoming Xie, F. Kanehiro, 2024, 2024 IEEE-RAS 23rd International Conference on Humanoid Robots (Humanoids))
- Safe and Efficient Auto-tuning to Cross Sim-to-real Gap for Bipedal Robot(Yidong Du, Xuechao Chen, Zhangguo Yu, Yuanxi Zhang, Zishun Zhou, Jindai Zhang, Jintao Zhang, Botao Liu, Qiang Huang, 2024, 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- Motor-Guided Randomization Reinforcement Learning in Humanoid Robots Walking on Terrain Locomotion(Yuhan Gong, Yanzhen Li, Xiangrong Sun, Lihao Jia, 2025, 2025 25th International Conference on Control, Automation and Systems (ICCAS))
- Towards Real Robot Learning in the Wild: A Case Study in Bipedal Locomotion(Michael Bloesch, Jan Humplik, Viorica Patraucean, Roland Hafner, Tuomas Haarnoja, Arunkumar Byravan, Noah Siegel, S. Tunyasuvunakool, Federico Casarini, Nathan Batchelor, Francesco Romano, Stefano Saliceti, Martin A. Riedmiller, S. Eslami, N. Heess, 2021, No journal)
- Biped Robots Control in Gusty Environments with Adaptive Exploration Based DDPG(Yilin Zhang, Huimin Sun, Honglin Sun, Yuan Huang, Kenji Hashimoto, 2024, Biomimetics)
- DDPG Reinforcement Learning Experiment for Improving the Stability of Bipedal Walking of Humanoid Robots(Yeonghun Chun, Junghun Choi, Injoon Min, M. Ahn, Jeakweon Han, 2023, 2023 IEEE/SICE International Symposium on System Integration (SII))
- Sim-to-Real Learning for Humanoid Box Loco-Manipulation(Jeremy Dao, Helei Duan, Alan Fern, 2023, 2024 IEEE International Conference on Robotics and Automation (ICRA))
- Booster Gym: An End-to-End Reinforcement Learning Framework for Humanoid Robot Locomotion(Yushi Wang, Penghui Chen, Xinyu Han, Feng Wu, Mingguo Zhao, 2025, ArXiv)
- Online Behavior-Centric Adaptation for Bipedal Robot Sim-to-Real Transfer With Unmodeled Dynamics Mismatch(Xuechao Chen, Yidong Du, Zishun Zhou, Zhicheng Yuan, Qingrui Zhao, Fei Meng, Zhangguo Yu, Peng-Chen Lu, Qiang Huang, 2026, IEEE Transactions on Automation Science and Engineering)
- Sim-to-Real Transfer in Deep Reinforcement Learning for Bipedal Locomotion(Lingfan Bao, Tianhu Peng, Chengxu Zhou, 2025, ArXiv)
- Bridging the Reality Gap: Analyzing Sim-to-Real Transfer Techniques for Reinforcement Learning in Humanoid Bipedal Locomotion(Donghyeon Kim, Hokyun Lee, Junhyeok Cha, Jaeheung Park, 2025, IEEE Robotics & Automation Magazine)
- Learning Push Recovery Strategies for Humanoid Robots Using Deep Reinforcement Learning(Kristina Savevska, A. Ude, 2025, 2025 IEEE International Conference on Advanced Robotics (ICAR))
- Push Recovery Control for a Biped Robot Using DDPG Reinforcement Learning algorithm(Erfan Salman Mohajer, M. S. Noorani, 2024, 2024 10th International Conference on Artificial Intelligence and Robotics (QICAR))
- Push Recovery Control for Humanoid Robot Using Reinforcement Learning(Harin Kim, Donghyeon Seo, Donghan Kim, 2019, 2019 Third IEEE International Conference on Robotic Computing (IRC))
- Research on Disturbance Immunity Balance Control of Biped Robot Based on ESO-DDPG(Jianjun Yu, R. Zhu, Daoxiong Gong, Naigong Yu, Miaoqiang Zhou, 2023, 2023 9th International Conference on Information, Cybernetics, and Computational Social Systems (ICCSS))
- Constrained Reinforcement Learning for Unstable Point-Feet Bipedal Locomotion Applied to the Bolt Robot(Constant Roux, Elliot Chane-Sane, Ludovic De Matteïs, T. Flayols, Jérôme Manhes, O. Stasse, P. Souéres, 2025, 2025 IEEE-RAS 24th International Conference on Humanoid Robots (Humanoids))
- Dynamic Bipedal Turning through Sim-to-Real Reinforcement Learning(Fangzhou Yu, Ryan Batke, Jeremy Dao, J. Hurst, Kevin R. Green, Alan Fern, 2022, 2022 IEEE-RAS 21st International Conference on Humanoid Robots (Humanoids))
- Dynamic Bipedal Maneuvers through Sim-to-Real Reinforcement Learning(Fangzhou Yu, Ryan Batke, Jeremy Dao, J. Hurst, Kevin R. Green, Alan Fern, 2022, ArXiv)
- Robust RL Control for Bipedal Locomotion with Closed Kinematic Chains(Egor Maslennikov, Eduard Zaliaev, Nikita Dudorov, Oleg Shamanin, Karanov Dmitry, Gleb Afanasev, Alexey Burkov, Egor Lygin, Simeon Nedelchev, Evgeny Ponomarev, 2025, ArXiv)
生物启发步态、模仿学习与运动生成
利用人类运动数据(Mocap)、物理对称性约束或被动动力学模型来引导RL,旨在产生自然、节能且符合生物力学的步态。涵盖了从行走、跑步到复杂足球技能的模仿与参数化生成。
- Human Imitated Bipedal Locomotion with Frequency Based Gait Generator Network(Yusuf Baran Ates, Ömer Morgül, 2025, ArXiv)
- Leveraging Symmetry in RL-based Legged Locomotion Control(Zhi Su, Xiaoyu Huang, Daniel Ordonez-Apraez, Yunfei Li, Zhongyu Li, Qiayuan Liao, Giulio Turrisi, Massimiliano Pontil, Claudio Semini, Yi Wu, K. Sreenath, 2024, 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- Multimodal bipedal locomotion generation with passive dynamics via deep reinforcement learning(Shunsuke Koseki, Kyo Kutsuzawa, D. Owaki, M. Hayashibe, 2023, Frontiers in Neurorobotics)
- Gait Learning for 3D Bipedal Robots Based on a Combined Strategy of Hybrid Zero Dynamics Feedback Control and Periodic Reward(Lin Cui, Tian-Qi Deng, Lihua Ma, Wenhao He, 2024, 2024 International Conference on Electrical Power Systems and Intelligent Control (EPSIC))
- Learning Virtual Passive Dynamic Walking Using the Kneed Walker Model for Guiding Policies(Cong-Thanh Vu, Chi-Cheng Lai, Yen-Chen Liu, 2025, 2025 IEEE-RAS 24th International Conference on Humanoid Robots (Humanoids))
- Walk2Run: A Bio-Rhythm-Inspired Unified Control Framework for Humanoid Robot Walking and Running(Teng Zhang, Xiangji Wang, Guanqun Chen, Fucheng Liu, Fu-Yuan Zha, Wei Guo, 2025, Journal of Bionic Engineering)
- Data-driven gait model for bipedal locomotion over continuous changing speeds and inclines(Bharat Singh, Suchit Patel, Ankit Vijayvargiya, Rajesh Kumar, 2023, Autonomous Robots)
- Learning symmetric and low-energy locomotion(Wenhao Yu, Greg Turk, C. Liu, 2018, ACM Transactions on Graphics (TOG))
- Sim-to-Real Learning of All Common Bipedal Gaits via Periodic Reward Composition(Jonah Siekmann, Yesh Godse, Alan Fern, J. Hurst, 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA))
- Adaptive Mimic: Deep Reinforcement Learning of Parameterized Bipedal Walking from Infeasible References(Chong Zhang, Qi Wu, Liqian Ma, Hongyuan Su, 2021, ArXiv)
- Learning Human-to-Humanoid Real-Time Whole-Body Teleoperation(Tairan He, Zhengyi Luo, Wenli Xiao, Chong Zhang, Kris Kitani, Changliu Liu, Guanya Shi, 2024, 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- Embrace Collisions: Humanoid Shadowing for Deployable Contact-Agnostics Motions(Ziwen Zhuang, Hang Zhao, 2025, ArXiv)
- Learning agile soccer skills for a bipedal robot with deep reinforcement learning(Tuomas Haarnoja, Ben Moran, Guy Lever, Sandy H. Huang, Dhruva Tirumala, Markus Wulfmeier, Jan Humplik, S. Tunyasuvunakool, Noah Siegel, Roland Hafner, Michael Bloesch, Kristian Hartikainen, Arunkumar Byravan, Leonard Hasenclever, Yuval Tassa, Fereshteh Sadeghi, Nathan Batchelor, Federico Casarini, Stefano Saliceti, Charles Game, Neil Sreendra, Kushal Patel, Marlon Gwira, Andrea Huber, N. Hurley, F. Nori, R. Hadsell, N. Heess, 2023, Science Robotics)
- Human sensory-musculoskeletal modeling and control of whole-body movements(Chenhui Zuo, Guohao Lin, Chen Zhang, Shanning Zhuang, Yanan Sui, 2025, ArXiv)
- Adaptive motor patterns and reflexes for bipedal locomotion on rough terrain(Qi Liu, Jie Zhao, Steffen Schütz, K. Berns, 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
感知增强、地形适应与自主导航控制
关注机器人在非结构化环境(楼梯、斜坡、障碍物、松软土壤)中的表现,涉及视觉感知集成、地形感知表征学习、世界模型(World Models)以及高层路径规划与底层步态的耦合。
- No More Blind Spots: Learning Vision-Based Omnidirectional Bipedal Locomotion for Challenging Terrain(M. S. Gadde, Pranay Dugar, Ashish Malik, Alan Fern, 2025, 2025 IEEE-RAS 24th International Conference on Humanoid Robots (Humanoids))
- Learning Terrain Aware Bipedal Locomotion via Reduced Dimensional Perceptual Representations(Guillermo A. Castillo, Himanshu Lodha, Ayonga Hereid, 2025, ArXiv)
- Blind Bipedal Stair Traversal via Sim-to-Real Reinforcement Learning(Jonah Siekmann, Kevin R. Green, John Warila, Alan Fern, J. Hurst, 2021, ArXiv)
- Learning Vision-Based Bipedal Locomotion for Challenging Terrain(Helei Duan, Bikram Pandit, M. S. Gadde, Bart Jaap van Marum, Jeremy Dao, Chanho Kim, Alan Fern, 2023, 2024 IEEE International Conference on Robotics and Automation (ICRA))
- Terrain-Adaptive Bipedal Locomotion via Reinforcement Learning with Human-Inspired Stepping Strategy(Yunpeng Liang, Yan-zheng Zhao, Weixin Yan, 2025, No journal)
- VMTS: Vision-Assisted Teacher-Student Reinforcement Learning for Multi-Terrain Locomotion in Bipedal Robots(Fu Chen, Rui Wan, Peidong Liu, Nanxing Zheng, Bo Zhou, 2025, 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- LSWM: A Long–Short History World Model for Bipedal Locomotion via Reinforcement Learning(Jie Xue, ZhiYuan Liang, Haiming Mou, Qingdu Li, Jianwei Zhang, 2026, Biomimetics)
- EmoBipedNav: Emotion-aware Social Navigation for Bipedal Robots with Deep Reinforcement Learning(Wei Zhu, Abirath Raju, Abdulaziz Shamsah, Anqi Wu, Seth Hutchinson, Ye Zhao, 2025, ArXiv)
- Learning Bipedal Walking On Planned Footsteps For Humanoid Robots(Rohan P. Singh, M. Benallegue, M. Morisawa, Rafael Cisneros, F. Kanehiro, 2022, 2022 IEEE-RAS 21st International Conference on Humanoid Robots (Humanoids))
- Learning 3D Bipedal Walking with Planned Footsteps and Fourier Series Periodic Gait Planning(Song Wang, Songhao Piao, X. Leng, Zhicheng He, 2023, Sensors (Basel, Switzerland))
- Soft Soil Gait Planning and Control for Biped Robot using Deep Deterministic Policy Gradient Approach(G. Bhardwaj, Soham Dasgupta, N. Sukavanam, R. Balasubramanian, 2023, ArXiv)
- Learning Point-to-Point Bipedal Walking Without Global Navigation(Xueyan Ma, Boxing Wang, C. Tian, B. Xing, Yufei Liu, 2025, 2025 IEEE-RAS 24th International Conference on Humanoid Robots (Humanoids))
分层控制架构与课程学习策略
采用分层强化学习(HRL)或课程学习(Curriculum Learning)方案,将任务分解为高层策略和底层执行器,解决样本效率低和任务复杂度高的问题,实现不同步态间的平滑切换。
- Hierarchical deep reinforcement learning to drag heavy objects by adult-sized humanoid robot(Saeed Saeedvand, Hanjaya Mandala, J. Baltes, 2021, Appl. Soft Comput.)
- Reinforcement Learning based Hierarchical Control for Path Tracking of a Wheeled Bipedal Robot with Sim-to-Real Framework(Wei Zhu, Fahad Raza, M. Hayashibe, 2022, 2022 IEEE/SICE International Symposium on System Integration (SII))
- Hybrid Bipedal Locomotion Based on Reinforcement Learning and Heuristics(Zhicheng Wang, Wandi Wei, Anhuan Xie, Yifeng Zhang, Jun Wu, Qiuguo Zhu, 2022, Micromachines)
- Reinforcement Learning-Based Cascade Motion Policy Design for Robust 3D Bipedal Locomotion(Guillermo A. Castillo, Bowen Weng, Wei Zhang, Ayonga Hereid, 2022, IEEE Access)
- Hierarchical Reinforcement Learning of Locomotion Policies in Response to Approaching Objects: A Preliminary Study(Shangqun Yu, Sreehari Rammohan, Kaiyu Zheng, G. Konidaris, 2022, ArXiv)
- Hierarchical Curriculum Learning with Optimized Experience Replay for Sample-Efficient Humanoid Locomotion(Chao Guan, Qiubo Zhong, Fei Chen, 2025, 2025 IEEE International Conference on Mechatronics and Automation (ICMA))
- Learning Setup Policies: Reliable Transition Between Locomotion Behaviours(Brendan Tidd, N. Hudson, Akansel Cosgun, J. Leitner, 2021, IEEE Robotics and Automation Letters)
- DeepWalk: Omnidirectional Bipedal Gait by Deep Reinforcement Learning(Diego Rodriguez, Sven Behnke, 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA))
- Hierarchical World Models as Visual Whole-Body Humanoid Controllers(Nicklas Hansen, V JyothirS, Vlad Sobal, Yann LeCun, Xiaolong Wang, Hao Su, 2024, ArXiv)
双轮足机器人控制与新型系统平台应用
特别关注轮足复合形态机器人的平滑控制、轮速切换自适应,以及具体硬件平台(如Berkeley Humanoid, Cassie, Zbot)的设计与特定场景(RoboCup)下的应用实践。
- Adaptive multi-mode locomotion for bipedal wheel-legged robots via sparse mixture-of-experts deep reinforcement learning(Pan He, Zeang Zhao, Shengyu Duan, Panding Wang, Hongshuai Lei, 2026, Frontiers in Robotics and AI)
- Walking, Rolling, and Beyond: First-Principles and RL Locomotion on a TARS-Inspired Robot(Aditya Sripada, Abhishek Warrier, 2025, 2025 IEEE-RAS 24th International Conference on Humanoid Robots (Humanoids))
- AnyBipe: An Automated End-to-End Framework for Training and Deploying Bipedal Robots Powered by Large Language Models(Yifei Yao, Wentao He, Chenyu Gu, Jiaheng Du, Fuwei Tan, Zhen Zhu, Jun-Guo Lu, 2025, 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- ZBOT: A Novel Modular Robot Capable of Active Transformation from Snake to Bipedal Configuration through RL(Nanlin Zhou, Sikai Zhao, Hang Luo, Kai Han, Zhiyuan Yang, Jian Qi, Ning Zhao, Jie Zhao, Yanhe Zhu, 2025, 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- Berkeley Humanoid: A Research Platform for Learning-Based Control(Qiayuan Liao, Bike Zhang, Xuanyu Huang, Xiaoyu Huang, Zhongyu Li, K. Sreenath, 2024, 2025 IEEE International Conference on Robotics and Automation (ICRA))
- Lightweight Humanoid Locomotion with Minimal Sensing: IRSE X1's Design and Reinforcement Learning Control(Yuanpeng Wang, Shilong Sun, Haodong Huang, Chiyao Li, Yihui Li, 2025, 2025 IEEE International Conference on Robotics and Biomimetics (ROBIO))
- Deep Reinforcement Learning for Low-Cost Humanoid Robot Soccer Players: Dynamic Skills and Efficient Transfer(Animoni Nagaraju, M. G. V. Kumar, Y. Devi, A. Basi Reddy, M. Kumar, Ajmeera Kiran, 2023, 2023 Seventh International Conference on Image Information Processing (ICIIP))
- Feedback Control For Cassie With Deep Reinforcement Learning(Zhaoming Xie, G. Berseth, Patrick Clary, J. Hurst, M. V. D. Panne, 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- Self-Distilled Action Policy Smoothing Conditioning for Continuous Control(Geunyoung Bae, Minjae Kim, Soohee Han, 2025, 2025 25th International Conference on Control, Automation and Systems (ICCAS))
本报告综合了基于深度强化学习的双轮足机器人控制领域的关键研究。研究脉络从基础的强化学习算法优化(如PPO、SAC的改进)出发,深入探讨了如何通过融合经典动力学模型(MPC/WBC)提升控制的物理一致性。针对真实环境部署,Sim-to-Real迁移技术与抗扰动平衡恢复构成了研究的核心壁垒。同时,通过生物启发和模仿学习,机器人步态正向自然化、多样化演进。随着视觉感知与分层规划架构的引入,双轮足机器人展现出在复杂地形下自主导航与任务执行的巨大潜力。最后,针对轮足复合形态的专项优化,标志着该领域正向更高效、更灵活的具身智能方向迈进。
总计147篇相关文献
We investigated whether deep reinforcement learning (deep RL) is able to synthesize sophisticated and safe movement skills for a low-cost, miniature humanoid robot that can be composed into complex behavioral strategies. We used deep RL to train a humanoid robot to play a simplified one-versus-one soccer game. The resulting agent exhibits robust and dynamic movement skills, such as rapid fall recovery, walking, turning, and kicking, and it transitions between them in a smooth and efficient manner. It also learned to anticipate ball movements and block opponent shots. The agent’s tactical behavior adapts to specific game contexts in a way that would be impractical to manually design. Our agent was trained in simulation and transferred to real robots zero-shot. A combination of sufficiently high-frequency control, targeted dynamics randomization, and perturbations during training enabled good-quality transfer. In experiments, the agent walked 181% faster, turned 302% faster, took 63% less time to get up, and kicked a ball 34% faster than a scripted baseline. OP3 humanoid robots learned to play agile soccer using deep reinforcement learning. Editor’s summary Generating robust motor skills in bipedal robots in the real world is challenging because of the inability of current control methods to generalize to specific tasks. Haarnoja et al. developed a deep reinforcement learning–based framework for full-body control of humanoid robots, enabling a game of one-versus-one soccer. The robots exhibited emergent behaviors in the form of dynamic motor skills such as the ability to recover from falls and also tactics like defending the ball against an opponent. The robot movements were faster when using their framework than a scripted baseline controller and may have potential for more complex multirobot interactions. —Amos Matsiko
Due to the nonlinearity and underactuation of bipedal robots, developing efficient jumping strategies remains challenging. To address this, a multiobjective collaborative deep reinforcement learning algorithm based on the actor‐critic framework is presented. Initially, two deep deterministic policy gradient (DDPG) networks are established for training the jumping motion, each focusing on different objectives and collaboratively learning the optimal jumping policy. Following this, a recovery experience replay mechanism, predicated on dynamic time warping, is integrated into the DDPG to enhance sample utilization efficiency. Concurrently, a timely adjustment unit is incorporated, which works in tandem with the training frequency to improve the convergence accuracy of the algorithm. Additionally, a Markov decision process is designed to manage the complexity and parameter uncertainty in the dynamic model of the bipedal robot. Finally, the proposed method is validated on a PyBullet platform. The results show that the method outperforms baseline methods by improving learning speed and enabling robust jumps with greater height and distance.
In this letter, we review the question of which action space is best suited for controlling a real biped robot in combination with Sim2Real training. Position control has been popular as it has been shown to be more sample efficient and intuitive to combine with other planning algorithms. However, for position control, gain tuning is required to achieve the best possible policy performance. We show that, instead, using a torque-based action space enables task-and-robot agnostic learning with less parameter tuning and mitigates the sim-to-reality gap by taking advantage of torque control's inherent compliance. Also, we accelerate the torque-based-policy training process by pre-training the policy to remain upright by compensating for gravity. The letter showcases the first successful sim-to-real transfer of a torque-based deep reinforcement learning policy on a real human-sized biped robot.
: The bipedal walking robot is an advanced anthropomorphic robot that can mimic the human ability to walk. Controlling the bipedal walking robot is difficult due to its nonlinearity and complexity. To solve this problem, recent studies have applied various machine learning algorithms based on reinforcement learning approaches, however most of them rely on deterministic-policy-based strategy. This research proposes Soft Actor Critic (SAC), which has stochastic policy strategy for controlling the bipedal walking robot. The option thought deterministic and stochastic policy affects the exploration of DRL algorithm. The SAC is a Deep Reinforcement Learning (DRL) based algorithm whose improvement obtained through the augmented entropy-based expected return allows the SAC algorithm to learn faster, gain exploration ability, and still ensure convergence. The SAC algorithm’s performance is validated with a bipedal robot to walk towards the straight-line trajectory. The number of the reward and the cumulative reward during the training is used as the algorithm's performance evaluation. The SAC algorithm controls the bipedal walking robot well with a total reward of 384,752.8.
Abstract Recovering after an abrupt push is essential for bipedal robots in real-world applications within environments where humans must collaborate closely with robots. There are several balancing algorithms for bipedal robots in the literature, however most of them either rely on hard coding or power-hungry algorithms. We propose a hybrid autonomous controller that hierarchically combines two separate, efficient systems, to address this problem. The lower-level system is a reliable, high-speed, full state controller that was hardcoded on a microcontroller to be power efficient. The higher-level system is a low-speed reinforcement learning controller implemented on a low-power onboard computer. While one controller offers speed, the other provides trainability and adaptability. An efficient control is then formed without sacrificing adaptability to new dynamic environments. Additionally, as the higher-level system is trained via deep reinforcement learning, the robot could learn after deployment, which is ideal for real-world applications. The system’s performance is validated with a real robot recovering after a random push in less than 5 s, with minimal steps from its initial positions. The training was conducted using simulated data.
No abstract available
This chapter addresses the critical challenge of simulation-to-reality (sim-to-real) transfer for deep reinforcement learning (DRL) in bipedal locomotion. After contextualizing the problem within various control architectures, we dissect the ``curse of simulation''by analyzing the primary sources of sim-to-real gap: robot dynamics, contact modeling, state estimation, and numerical solvers. Building on this diagnosis, we structure the solutions around two complementary philosophies. The first is to shrink the gap through model-centric strategies that systematically improve the simulator's physical fidelity. The second is to harden the policy, a complementary approach that uses in-simulation robustness training and post-deployment adaptation to make the policy inherently resilient to model inaccuracies. The chapter concludes by synthesizing these philosophies into a strategic framework, providing a clear roadmap for developing and evaluating robust sim-to-real solutions.
We propose a multipath gait control strategy based on deep reinforcement learning (DRL) for bipedal robot motion planning on diverse and challenging terrains. Traditional control methods, such as PID controllers and model-based motion planning, often struggle in complex environments. These approaches typically underperform because they rely on precise mathematical models or predefined rules, making them ill-suited for nonlinear, uncertain, and dynamic settings. Conventional techniques also have difficulty adapting their control strategies in unpredictable and fluctuating terrains, where robots may encounter unforeseen disturbances, leading to instability or failure. Deep reinforcement learning is able to independently acquire optimal control methods from environmental feedback without requiring a precise model since it combines deep learning and reinforcement learning. In this work, we leverage deep reinforcement learning algorithms (DDPG, TRPO, PPO, A3C, SAC, etc.) based on actor-critic (AC) architectures to enable reliable gait control of bipedal robots in a continuous motion environment. The issue that traditional approaches have in challenging to convergecomplicated environments is solved by DRL, which, when compared to traditional methods, can effectively cope with the high nonlinearity of complex terrain and adaptively alter the strategy through continuous contact with the environment. Using goal-conditional techniques, we created a motion planning model and tested it on the actual hardware platform Cassie. According to the experimental results, the approach successfully transfers the simulation strategy to the actual environment, and the robot can accurately complete the goal task without global location feedback. It can also perform a variety of complex tasks, like jumping on discontinuous and flat terrain. Furthermore, the method exhibits significant robustness and adaptability through multithreaded asynchronous training and randomized strategy selection, which solves the shortcomings of conventional motion planning methods in hyperparameter tuning and strategy convergence.
No abstract available
In this research, an optimization methodology was introduced for improving bipedal robot locomotion controlled by reinforcement learning (RL) algorithms. Specifically, the study focused on optimizing the Proximal Policy Optimization (PPO), Advantage Actor-Critic (A2C), Soft Actor-Critic (SAC), and Twin Delayed Deep Deterministic Policy Gradients (TD3) algorithms. The optimization process utilized the Tree-structured Parzen Estimator (TPE), a Bayesian optimization technique. All RL algorithms were applied to the same environment, which was created within the OpenAI GYM framework and known as the bipedal walker. The optimization involved the fine-tuning of key hyperparameters, including learning rate, discount factor, generalized advantage estimation, entropy coefficient, and Polyak update parameters. The study comprehensively analyzed the impact of these hyperparameters on the performance of RL algorithms. The results of the optimization efforts were promising, as the fine-tuned RL algorithms demonstrated significant improvements in performance. The mean reward values for the 10 trials were as follows: PPO achieved an average reward of 181.3, A2C obtained an average reward of −122.2, SAC reached an average reward of 320.3, and TD3 had an average reward of 278.6. These outcomes underscore the effectiveness of the optimization approach in enhancing the locomotion capabilities of the bipedal robot using RL techniques.
The bipedal wheel-legged robot combines the high energy efficiency of wheeled movement with the terrain adaptability of legged locomotion. However, achieving a smooth transition between these two heterogeneous motion modes within a unified control framework remains challenging. This study proposes a reinforcement learning control framework that integrates the Mixture of Experts (MoE) architecture. This approach employs a “divide and conquer” strategy by introducing a dynamic gating network and a Top-K sparse activation mechanism, which automatically allocates different motion modes to specific expert subnetworks, effectively decoupling conflicting gradients. Simulation results demonstrate that, compared to the single-network PPO method, the MoE-enhanced algorithm exhibits significant improvements in training stability and rewards. The learned policy successfully achieved smooth rolling on flat surfaces and transitioned to dynamic leg-lifting gaits when confronted with obstacles. In various test terrains, it showed a markedly higher success rate compared to the single-network PPO method.
Considering the dynamics and non-linear characteristics of biped robots, gait optimization is an extremely challenging task. To tackle this issue, a parallel heterogeneous policy Deep Reinforcement Learning (DRL) algorithm for gait optimization is proposed. Firstly, the Deep Deterministic Policy Gradient (DDPG) algorithm is used as the main architecture to run multiple biped robots in parallel to interact with the environment. And the network is shared to improve the training efficiency. Furthermore, heterogeneous experience replay is employed instead of the traditional experience replay mechanism to optimize the utilization of experience. Secondly, according to the walking characteristics of biped robots, a biped robot periodic gait is designed with reference to sinusoidal curves. The periodic gait takes into account the effects of foot lift height, walking period, foot lift speed and ground contact force of the biped robot. Finally, different environments and different biped robot models pose challenges for different optimization algorithms. Thus, a unified gait optimization framework for biped robots based on the RoboCup3D platform is established. Comparative experiments were conducted using the unified gait optimization framework, and the experimental results show that the method outlined in this paper can make the biped robot walk faster and more stably.
Trajectory planning and control of bipedal walking robots require precise joint torque computation to ensure stability and efficiency. Given the nonlinear dynamics and complex interactions of bipedal systems, achieving stable walking remains a major challenge. Deep reinforcement learning (DRL) offers a promising solution by directly mapping observed states to optimal actions that maximize cumulative rewards. In this work, we integrate deep learning-based trajectory planning with a DRL-driven control system to generate optimal joint torque sequences. Our approach aims to achieve stable walking with maximum forward speed, minimal power consumption, and enhanced stability to prevent falls. After training, the bipedal robot demonstrates stable and resilient locomotion, maintaining balance throughout the gait cycle. Additionally, it exhibits robust performance under uncertainties, handling mass variations of up to 20% and length variations of up to 5%. The robot effectively rejects disturbances at different angular velocities across various gait phases, enhancing adaptability. This approach improves the robustness and efficiency of bipedal robots, making them more suitable for real-world applications requiring reliable and adaptive locomotion.
Bipedal walking is one of the most difficult but exciting challenges in robotics. The difficulties arise from the complexity of high-dimensional dynamics, sensing and actuation limitations combined with real-time and computational constraints. Deep Reinforcement Learning (DRL) holds the promise to address these issues by fully exploiting the robot dynamics with minimal craftsmanship. In this paper, we propose a novel DRL approach that enables an agent to learn omnidirectional locomotion for humanoid (bipedal) robots. Notably, the locomotion behaviors are accomplished by a single control policy (a single neural network). We achieve this by introducing a new curriculum learning method that gradually increases the task difficulty by scheduling target velocities. In addition, our method does not require reference motions which facilities its application to robots with different kinematics, and reduces the overall complexity. Finally, different strategies for sim-to-real transfer are presented which allow us to transfer the learned policy to a real humanoid robot.
This paper presents a comprehensive study on using deep reinforcement learning (RL) to create dynamic locomotion controllers for bipedal robots. Going beyond focusing on a single locomotion skill, we develop a general control solution that can be used for a range of dynamic bipedal skills, from periodic walking and running to aperiodic jumping and standing. Our RL-based controller incorporates a novel dual-history architecture, utilizing both a long-term and short-term input/output (I/O) history of the robot. This control architecture, when trained through the proposed end-to-end RL approach, consistently outperforms other methods across a diverse range of skills in both simulation and the real world. The study also delves into the adaptivity and robustness introduced by the proposed RL system in developing locomotion controllers. We demonstrate that the proposed architecture can adapt to both time-invariant dynamics shifts and time-variant changes, such as contact events, by effectively using the robot’s I/O history. Additionally, we identify task randomization as another key source of robustness, fostering better task generalization and compliance to disturbances. The resulting control policies can be successfully deployed on Cassie, a torque-controlled human-sized bipedal robot. This work pushes the limits of agility for bipedal robots through extensive real-world experiments. We demonstrate a diverse range of locomotion skills, including: robust standing, versatile walking, fast running with a demonstration of a 400-meter dash, and a diverse set of jumping skills, such as standing long jumps and high jumps.
No abstract available
Balance control remains a fundamental challenge in humanoid robotics due to the underactuated, high-dimensional, and dynamically unstable nature of bipedal systems. Conventional control methods, typically relying on simplified dynamic models, offer limited adaptability and require extensive manual tuning, particularly under unpredictable external disturbances. In this work, we propose a data-driven framework for push recovery of the humanoid robot Talos using deep reinforcement learning (DRL). A key contribution of our approach lies in the reward function, which incorporates principles from capture point and divergent component of motion (DCM) theory to encourage stable and human-like balance strategies. By training under a broad distribution of perturbations, the learned policy autonomously discovers a spectrum of recovery behaviors, including ankle, hip, and stepping responses, without access to explicit model dynamics. We further evaluate the policy’s robustness under previously unseen perturbations to assess generalization. Results demonstrate that our method enables fast convergence, diverse strategy deployment, and strong resilience to unexpected disturbances, highlighting the efficacy of physically informed reward shaping in DRL-based humanoid control.
Reinforcement learning (RL) has emerged as a powerful method to learn robust control policies for bipedal locomotion. Yet, it can be difficult to tune desired robot behaviors due to unintuitive and complex reward design. In comparison, offline trajectory optimization methods, like Hybrid Zero Dynamics, offer more tuneable, interpretable, and mathematically grounded motion plans for high-dimensional legged systems. However, these methods often remain brittle to real-world disturbances like external perturbations. In this work, we present NaviGait, a hierarchical framework that combines the structure of trajectory optimization with the adaptability of RL for robust and intuitive locomotion control. NaviGait leverages a library of offline-optimized gaits and smoothly interpolates between them to produce continuous reference motions in response to high-level commands. The policy provides both joint-level and velocity command residual corrections to modulate and stabilize the reference trajectories in the gait library. One notable advantage of NaviGait is that it dramatically simplifies reward design by encoding rich motion priors from trajectory optimization, reducing the need for finely tuned shaping terms and enabling more stable and interpretable learning. Our experimental results demonstrate that NaviGait enables faster training compared to conventional and imitation-based RL, and produces motions that remain closest to the original reference. Overall, by decoupling high-level motion generation from low-level correction, NaviGait offers a more scalable and generalizable approach for achieving dynamic and robust locomotion.
Not until recently, robust robot locomotion has been achieved by deep reinforcement learning (DRL). However, for efficient learning of parametrized bipedal walking, developed references are usually required, limiting the performance to that of the references. In this paper, we propose to design an adaptive reward function for imitation learning from the references. The agent is encouraged to mimic the references when its performance is low, while to pursue high performance when it reaches the limit of references. We further demonstrate that developed references can be replaced by low-quality references that are generated without laborious tuning and infeasible to deploy by themselves, as long as they can provide a priori knowledge to expedite the learning process.
For the deployment of legged robots in real-world environments, it is essential to develop robust locomotion control methods for challenging terrains that may exhibit unexpected deformability and irregularity. In this paper, we explore the application of sim-to-real deep reinforcement learning (RL) for the design of bipedal locomotion controllers for humanoid robots on compliant and uneven terrains. Our key contribution is to show that a simple training curriculum for exposing the RL agent to randomized terrains in simulation can achieve robust walking on a real humanoid robot using only proprioceptive feedback. We train an end-to-end bipedal locomotion policy using the proposed approach, and show extensive real-robot demonstration on the HRP-5P humanoid over several difficult terrains inside and outside the lab environment. Further, we argue that the robustness of a bipedal walking policy can be improved if the robot is allowed to exhibit aperiodic motion with variable stepping frequency. We propose a new control policy to enable modification of the observed clock signal, leading to adaptive gait frequencies depending on the terrain and command velocity. Through simulation experiments, we show the effectiveness of this policy specifically for walking over challenging terrains by controlling swing and stance durations. The code for training and evaluation is available online. https://github.com/rohanpsingh/LearningHumanoidWalking
This paper investigates the stabilization of passive bipedal robots using a Soft Actor-Critic (SAC) reinforcement learning approach. The primary objective is to enhance the robot's dynamic stability while navigating various slopes, particularly focusing on achieving a periodic solution. Our findings reveal that the robot can maintain stable walking up to a slope of 0.0252 radians. The SAC agent enables direct control of the ankle angle, eliminating the need for prior identification of unstable limit cycle conditions and sensitivity analysis for control gains. This research highlights the potential of reinforcement learning in improving robotic locomotion. It presents a framework for future investigations into adaptive control strategies that can effectively manage the complexities of bipedal passive motion.
When a gait of a bipedal robot is developed using deep reinforcement learning, reference trajectories may or may not be used. Each approach has its advantages and disadvantages, and the choice of method is up to the control developer. This paper investigates the effect of reference trajectories on locomotion learning and the resulting gaits. We implemented three gaits of a full-order anthropomorphic robot model with different reward imitation ratios, provided sim-to-sim control policy transfer, and compared the gaits in terms of robustness and energy efficiency. In addition, we conducted a qualitative analysis of the gaits by interviewing people, since our task was to create an appealing and natural gait for a humanoid robot. According to the results of the experiments, the most successful approach was the one in which the average value of rewards for imitation and adherence to command velocity per episode remained balanced throughout the training. The gait obtained with this method retains naturalness (median of 3.6 according to the user study) compared to the gait trained with imitation only (median of 4.0), while remaining robust close to the gait trained without reference trajectories.
Deep Reinforcement Learning (DRL) makes a big improvement for robotic control. However, training stable and efficient bipedal gaits like walking and running still remains challenging. Current DRL approaches can be sample-inefficient and struggle with stability, while traditional methods may lack peak performance and adaptability. This paper proposes a novel approach integrating Proximal Policy Optimization (PPO) with Skill-Set-Primitives (SS-primitives), which predefined biomechanically-inspired motion patterns capturing commonalities across locomotion tasks. By providing a structured baseline, SS-primitives allow PPO to focus on refining performance and adapting to specific goals. We introduce a multi-objective reward function designed to prioritize key locomotion metrics. Experiments conducted in the SimSpark simulation environment demon-strate that our method significantly improves walking and running speed and stability compared to standard PPO training. The effectiveness of this approach was further validated by the Apollo3D team's success, achieving Champion at the RoboCup China Open 2024 and third place at RoboCup 2024 using this framework.
To improve the stability of bipedal walking of humanoid robots, we developed a method of setting trajectory parameters using reinforcement learning on a treadmill like testbed in a real-world environment. A deep deterministic policy gradient (DDPG) was used as the reinforcement learning algorithm. By improving the reward using a zero moment point (ZMP), the optimum value of walking stability and walking speed was determined. The robot was designed to measure the ZMP and mount weights on the upper body. In addition, a treadmill was manufactured to operate at the same speed as the walking speed of the robot. Reinforcement learning was divided into unweighted cases and cases with a weight of 1kg. At approximately 100 min, 300 episodes were performed, and reward improvements of 16.71% and 26.25% reward improvements were made. The ZMP measurements indicated that bipedal walking was performed in a safe area. Therefore, we demonstrated that the biped walking performance of a humanoid robot can be improved by the reinforcement learning of walking speed and ZMP similarity.
Deep reinforcement learning has seen successful implementations on humanoid robots to achieve dynamic walking. However, these implementations have been so far successful in simple environments void of obstacles. In this paper, we aim to achieve bipedal locomotion in an environment where obstacles are present using a policy-based reinforcement learning. By adding simple distance reward terms to a state of art reward function that can achieve basic bipedal locomotion, the trained policy succeeds in navigating the robot towards the desired destination without colliding with the obstacles along the way.
Recent advances in deep reinforcement learning (RL) based techniques combined with training in simulation have offered a new approach to developing robust controllers for legged robots. However, the application of such approaches to real hardware has largely been limited to quadrupedal robots with direct-drive actuators and light-weight bipedal robots with low gear-ratio transmission systems. Application to real, life-sized humanoid robots has been less common arguably due to a large sim2real gap. In this paper, we present an approach for effectively overcoming the sim2real gap issue for humanoid robots arising from inaccurate torque-tracking at the actuator level. Our key idea is to utilize the current feedback from the actuators on the real robot, after training the policy in a simulation environment artificially degraded with poor torque-tracking. Our approach successfully trains a unified, end-to-end policy in simulation that can be deployed on a real HRP-5P humanoid robot to achieve bipedal locomotion. Through ablations, we also show that a feedforward policy architecture combined with targeted dynamics randomization is sufficient for zero-shot sim2real success, thus eliminating the need for computationally expensive, memory-based network architectures. Finally, we validate the robustness of the proposed RL policy by comparing its performance against a conventional model-based controller for walking on uneven terrain with the real robot.
This paper proposes a model-free memory augmented Deep Reinforcement Learning (DRL) method that can deal with noisy sensors in humanoid locomotion. DRL-based agents are promising for automatically learning how to control robots in complex simulated environments. However, they are still not fully addressed with model-free control algorithms in challenging noisy scenarios for humanoid robots. This work shows how the Soft Actor-Critic (SAC) algorithm can benefit from the memory effect introduced by LSTMs to mitigate the side effects of Partially Observed Markov Decision Processes (POMDP). We demonstrate that LSTM-SAC is a viable path towards DRL for POMDP by applying it in a bipedal locomotion task with the NAO Robot in various noisy scenarios.
Knee-less bipedal robots like SLIDER have the advantage of ultra-lightweight legs and improved walking energy efficiency compared to traditional humanoid robots. In this paper, we firstly introduce an improved hardware design of the SLIDER bipedal robot with new line-feet and more optimized mass distribution that enables higher locomotion speeds. Secondly, we propose an extended Hybrid Zero Dynamics (eHZD) method, which can be applied to prismatic joint robots like SLIDER. The eHZD method is then used to generate a library of gaits with varying reference velocities in an offline way. Thirdly, a Guided Deep Reinforcement Learning (DRL) algorithm is proposed to use the pre-generated library to create walking control policies in real-time. This approach allows us to combine the advantages of both HZD (for generating stable gaits with a full-dynamics model) and DRL (for real-time adaptive gait generation). The experimental results show that this approach achieves 150% higher walking velocity than the previous MPC-based approach.
Locomotion control has long been vital to legged robots. Agile locomotion can be implemented through either model-based controller or reinforcement learning. It is proven that robust controllers can be obtained through model-based methods and learning-based policies have advantages in generalization. This paper proposed a hybrid framework of locomotion controller that combines deep reinforcement learning and simple heuristic policy and assigns them to different activation phases, which provides guidance for adaptive training without producing conflicts between heuristic knowledge and learned policies. The training in simulation follows a step-by-step stochastic curriculum to guarantee success. Domain randomization during training and assistive extra feedback loops on real robot are also adopted to smooth the transition to the real world. Comparison experiments are carried out on both simulated and real Wukong-IV humanoid robots, and the proposed hybrid approach matches the canonical end-to-end approaches with higher rate of success, faster converging speed, and 60% less tracking error in velocity tracking tasks.
Controlling a non-statically bipedal robot is challenging due to the complex dynamics and multi-criterion optimization involved. Recent works have demonstrated the effectiveness of deep reinforcement learning (DRL) for simulation and physical robots. In these methods, the rewards from different criteria are normally summed to learn a scalar function. However, a scalar is less informative and may be insufficient to derive effective information for each reward channel from the complex hybrid rewards. In this work, we propose a novel reward-adaptive reinforcement learning method for biped locomotion, allowing the control policy to be simultaneously optimized by multiple criteria using a dynamic mechanism. The proposed method applies a multi-head critic to learn a separate value function for each reward component, leading to hybrid policy gradients. We further propose dynamic weight, allowing each component to optimize the policy with different priorities. This hybrid and dynamic policy gradient (HDPG) design makes the agent learn more efficiently. We show that the proposed method outperforms summed-up-reward approaches and is able to transfer to physical robots. The MuJoCo results further demonstrate the effectiveness and generalization of HDPG.
Bipedal locomotion skills are challenging to develop. Control strategies often use local linearization of the dynamics in conjunction with reduced-order abstractions to yield tractable solutions. In these model-based control strategies, the controller is often not fully aware of many details, including torque limits, joint limits, and other non-linearities that are necessarily excluded from the control computations for simplicity. Deep reinforcement learning (DRL) offers a promising model-free approach for controlling bipedal locomotion which can more fully exploit the dynamics. However, current results in the machine learning literature are often based on ad-hoc simulation models that are not based on corresponding hardware. Thus it remains unclear how well DRL will succeed on realizable bipedal robots. In this paper, we demonstrate the effectiveness of DRL using a realistic model of Cassie, a bipedal robot. By formulating a feedback control problem as finding the optimal policy for a Markov Decision Process, we are able to learn robust walking controllers that imitate a reference motion with DRL. Controllers for different walking speeds are learned by imitating simple time-scaled versions of the original reference motion. Controller robustness is demonstrated through several challenging tests, including sensory delay, walking blindly on irregular terrain and unexpected pushes at the pelvis. We also show we can interpolate between individual policies and that robustness can be improved with an interpolated policy.
This article compares various implementations of deep Q learning as it is one of the most efficient reinforcement learning algorithms for discrete action space systems. The efficiency of the implementations for the classical Cartpole problem ported to the Gazebo environment is investigated. Then, these algorithms are compared for a self-created bipedal robot problem. Since the creation and configuration of a real robotic system is a laborious process, the initial debugging of the robot can be performed using the appropriate software that simulates the real environment. In our case, the Gazebo simulator was used. Using the simulator allows you to conduct research without having a real robotic system. In this case, it is possible to transfer the results from the simulator to the real system. The result of the study is the conclusion about the greatest efficiency of deep Q-learning with the experience reproduction mechanism. Also, the conclusion is that even for a robot with two degrees of freedom, Q-learning algorithms are not effective enough, and a comparative study with other families of reinforcement learning algorithms is needed.
Classical control techniques such as PID and LQR have been used effectively in maintaining a system state, but these techniques become more difficult to implement when the model dynamics increase in complexity and sensitivity. For adaptive robotic locomotion tasks with several degrees of freedom, this task becomes infeasible with classical control techniques. Instead, reinforcement learning can train optimal walking policies with ease. We apply deep Q-learning and augmented random search (ARS) to teach a simulated two-dimensional bipedal robot how to walk using the OpenAI Gym BipedalWalker-v3 environment. Deep Q-learning did not yield a high reward policy, often prematurely converging to suboptimal local maxima likely due to the coarsely discretized action space. ARS, however, resulted in a better trained robot, and produced an optimal policy which officially"solves"the BipedalWalker-v3 problem. Various naive policies, including a random policy, a manually encoded inch forward policy, and a stay still policy, were used as benchmarks to evaluate the proficiency of the learning algorithm results.
No abstract available
Deep reinforcement learning (RL) based controllers for legged robots have demonstrated impressive ro-bustness for walking in different environments for several robot platforms. To enable the application of RL policies for humanoid robots in real-world settings, it is crucial to build a system that can achieve robust walking in any direction, on 2D and 3D terrains, and be controllable by a user-command. In this paper, we tackle this problem by learning a policy to follow a given step sequence. The policy is trained with the help of a set of procedurally generated step sequences (also called footstep plans). We show that simply feeding the upcoming 2 steps to the policy is sufficient to achieve omnidirectional walking, turning in place, standing, and climbing stairs. Our method employs curriculum learning on the complexity of terrains, and circumvents the need for reference motions or pre-trained weights. We demonstrate the application of our proposed method to learn RL policies for 2 new robot platforms - HRP5P and JVRC-1 - in the MuJoCo simulation environment. The code for training and evaluation is available online.††https://github.com/rohanpsingh/LearningHumanoidWalking.
A humanoid robot similar to a human is structurally unstable, so the push recovery control is essential. The proposed push recovery controller consists of a IMU sensor part, a highlevel push recovery controller and a low-level push recovery controller. The IMU sensor part measures the linear velocity and angular velocity and transmits it to the high-level push recovery controller. The high-level push recovery controller selects the strategy of the low-level push recovery controller based on the stability region. The stability region is improved using the DQN(Deep Q-Network) of the reinforcement learning method. The low-level push recovery controller consists of a ankle, hip and step strategies. Each strategy is analyzed using LIPM(Linear Inverted Pendulum Model). Based on the analysis, the actuators corresponding to each strategy are controlled. Keywords-component; bipedal robot, humanoid robot, push recovery, balancing control, reinforcement learning
Machine learning algorithms have found several applications in the field of robotics and control systems. The control systems community has started to show interest towards several machine learning algorithms from the sub-domains such as supervised learning, imitation learning and reinforcement learning to achieve autonomous control and intelligent decision making. Amongst many complex control problems, stable bipedal walking has been the most challenging problem. In this paper, we present an architecture to design and simulate a planar bipedal walking robot(BWR) using a realistic robotics simulator, Gazebo. The robot demonstrates successful walking behaviour by learning through several of its trial and errors, without any prior knowledge of itself or the world dynamics. The autonomous walking of the BWR is achieved using reinforcement learning algorithm called Deep Deterministic Policy Gradient(DDPG). DDPG is one of the algorithms for learning controls in continuous action spaces. After training the model in simulation, it was observed that, with a proper shaped reward function, the robot achieved faster walking or even rendered a running gait with an average speed of 0.83 m/s. The gait pattern of the bipedal walker was compared with the actual human walking pattern. The results show that the bipedal walking pattern had similar characteristics to that of a human walking pattern. The video presenting our experiment is available at this https URL.
Reinforcement learning provides a general framework for achieving autonomy and diversity in traditional robot motion control. Robots must walk dynamically to adapt to different ground environments in complex environments. To achieve walking ability similar to that of humans, robots must be able to perceive, understand and interact with the surrounding environment. In 3D environments, walking like humans on rugged terrain is a challenging task because it requires complex world model generation, motion planning and control algorithms and their integration. So, the learning of high-dimensional complex motions is still a hot topic in research. This paper proposes a deep reinforcement learning-based footstep tracking method, which tracks the robot’s footstep position by adding periodic and symmetrical information of bipedal walking to the reward function. The robot can achieve robot obstacle avoidance and omnidirectional walking, turning, standing and climbing stairs in complex environments. Experimental results show that reinforcement learning can be combined with real-time robot footstep planning, avoiding the learning of path-planning information in the model training process, so as to avoid the model learning unnecessary knowledge and thereby accelerate the training process.
Achieving stability and robustness is the primary goal of biped locomotion control. Recently, deep reinforcement learning (DRL) has attracted great attention as a general methodology for constructing biped control policies and demonstrated significant improvements over the previous state-of-the-art control methods. Although deep control policies are more advantageous compared with previous controller design approaches, many questions remain: Are deep control policies as robust as human walking? Does simulated walking involve strategies similar to human walking for maintaining balance? Does a particular gait pattern affect human and simulated walking similarly? What do deep policies learn to achieve improved gait stability? The goal of this study is to address these questions by evaluating the push-recovery stability of deep policies compared with those of human subjects and a previous feedback controller. Furthermore, we conducted experiments to evaluate the effectiveness of variants of DRL algorithms.
Learning locomotion skills is a challenging problem. To generate realistic and smooth locomotion, existing methods use motion capture, finite state machines or morphology-specific knowledge to guide the motion generation algorithms. Deep reinforcement learning (DRL) is a promising approach for the automatic creation of locomotion control. Indeed, a standard benchmark for DRL is to automatically create a running controller for a biped character from a simple reward function [Duan et al. 2016]. Although several different DRL algorithms can successfully create a running controller, the resulting motions usually look nothing like a real runner. This paper takes a minimalist learning approach to the locomotion problem, without the use of motion examples, finite state machines, or morphology-specific knowledge. We introduce two modifications to the DRL approach that, when used together, produce locomotion behaviors that are symmetric, low-energy, and much closer to that of a real person. First, we introduce a new term to the loss function (not the reward function) that encourages symmetric actions. Second, we introduce a new curriculum learning method that provides modulated physical assistance to help the character with left/right balance and forward movement. The algorithm automatically computes appropriate assistance to the character and gradually relaxes this assistance, so that eventually the character learns to move entirely without help. Because our method does not make use of motion capture data, it can be applied to a variety of character morphologies. We demonstrate locomotion controllers for the lower half of a biped, a full humanoid, a quadruped, and a hexapod. Our results show that learned policies are able to produce symmetric, low-energy gaits. In addition, speed-appropriate gait patterns emerge without any guidance from motion examples or contact planning.
Dynamic platforms that operate over many unique terrain conditions typically require many behaviours. To transition safely, there must be an overlap of states between adjacent controllers. We develop a novel method for training setup policies that bridge the trajectories between pre-trained Deep Reinforcement Learning (DRL) policies. We demonstrate our method with a simulated biped traversing a difficult jump terrain, where a single policy fails to learn the task, and switching between pre-trained policies without setup policies also fails. We perform an ablation of key components of our system, and show that our method outperforms others that learn transition policies. We demonstrate our method with several difficult and diverse terrain types, and show that we can use setup policies as part of a modular control suite to successfully traverse a sequence of complex terrains. We show that using setup policies improves the success rate for traversing a single difficult jump terrain (from 51.3<inline-formula><tex-math notation="LaTeX">$\%$</tex-math></inline-formula> success rate with the best comparative method to 82.2<inline-formula><tex-math notation="LaTeX">$\%$</tex-math></inline-formula>), and traversing a random sequence of difficult obstacles (from 1.9<inline-formula><tex-math notation="LaTeX">$\%$</tex-math></inline-formula> without setup policies to 71.2<inline-formula><tex-math notation="LaTeX">$\%$</tex-math></inline-formula>).
No abstract available
No abstract available
No abstract available
No abstract available
This study proposes a deep reinforcement learning control strategy using the Twin Delayed Deep Deterministic (TD3) algorithm for the robust locomotion of a point-feet underactuated bipedal robot. We introduce two key contributions: a specialized balance recovery system and a bio-inspired reward function. The balance recovery system is explicitly trained to handle off-balance and fall-like conditions. Its effectiveness was validated through 50 randomized trials, where it achieved a 74% success rate in stabilizing the robot from a wide range of initial heights, velocities, and configurations. The bio-inspired reward function encourages the robot's hip to remain between its feet, which was shown to significantly improve gait stability. This reward shaping reduced the normalized fluctuation in joint angle movements by a factor of 1.75, even under external disturbances. The final controller produced an average running speed of 2.4 m/s and demonstrated robustness to external disturbances of up to ±60 Nm, paving the way for more resilient and adaptive bipedal locomotion.
Model-free reinforcement learning is a promising approach for autonomously solving challenging robotics control problems, but faces exploration difficulty without information about the robot’s morphology. The under-exploration of multiple modalities with symmetric states leads to behaviors that are often unnatural and sub-optimal. This issue becomes particularly pronounced in the context of robotic systems with morphological symmetries, such as legged robots for which the resulting asymmetric and aperiodic behaviors compromise performance, robustness, and transferability to real hardware. To mitigate this challenge, we can leverage symmetry to guide and improve the exploration in policy learning via equivariance / invariance constraints. We investigate the efficacy of two approaches to incorporate symmetry: modifying the network architectures to be strictly equivariant / invariant, and leveraging data augmentation to approximate equivariant / invariant actor-critics. We implement the methods on challenging loco-manipulation and bipedal locomotion tasks and compare with an unconstrained baseline. We find that the strictly equivariant policy consistently outperforms other methods in sample efficiency and task performance in simulation. Additionaly, symmetry-incorporated approaches exhibit better gait quality, higher robustness and can be deployed zero-shot to hardware.
Developing robust locomotion controllers for bipedal robots with closed kinematic chains presents unique challenges, particularly since most reinforcement learning (RL) approaches simplify these parallel mechanisms into serial models during training. We demonstrate that this simplification significantly impairs sim-to-real transfer by failing to capture essential aspects such as joint coupling, friction dynamics, and motor-space control characteristics. In this work, we present an RL framework that explicitly incorporates closed-chain dynamics and validate it on our custom-built robot TopA. Our approach enhances policy robustness through symmetry-aware loss functions, adversarial training, and targeted network regularization. Experimental results demonstrate that our integrated approach achieves stable locomotion across diverse terrains, significantly outperforming methods based on simplified kinematic models.
No abstract available
Achieving stable and natural locomotion in bipedal robots, comparable to that of humans and animals, remains a long-standing challenge in robotics. In this work, we propose a bio-inspired low-level control framework that streamlines the generation of naturalistic gait patterns while ensuring adaptability. Our approach begins with the design of a low-dimensional gait representation that captures key characteristics of human and animal locomotion. This representation is then integrated with the Linear Inverted Pendulum Model (LIPM) to form an abstract yet effective motion descriptor. Serving as a kinematic reference within a reinforcement learning (RL) framework, this descriptor enables the training of control policies that strike a balance between biomechanical realism and adaptability. Rather than strictly adhering to predefined gait trajectories, the learned policies dynamically adjust to optimize both stability and velocity tracking. As a result, our method enables bipedal robots to exhibit smooth, biomechanically realistic locomotion while enhancing stability and adaptability. We validate the proposed framework through real-world experiments on our bipedal robot, demonstrating its ability to achieve stable and efficient locomotion.
Model predictive control (MPC) has demonstrated effectiveness for humanoid bipedal locomotion; however, its applicability in challenging environments, such as rough and slippery terrain, is limited by the difficulty of modeling terrain interactions. In contrast, reinforcement learning (RL) has achieved notable success in training robust locomotion policies over diverse terrain, yet it lacks guarantees of constraint satisfaction and often requires substantial reward shaping. Recent efforts in combining MPC and RL have shown promise of taking the best of both worlds, but they are primarily restricted to flat terrain or quadrupedal robots. In this work, we propose an RL-augmented MPC framework tailored for bipedal locomotion over rough and slippery terrain. Our method parametrizes three key components of single-rigid-body-dynamics-based MPC: system dynamics, swing leg controller, and gait frequency. We validate our approach through bipedal robot simulations in NVIDIA IsaacLab across various terrains, including stairs, stepping stones, and low-friction surfaces. Experimental results demonstrate that our RL-augmented MPC framework produces significantly more adaptive and robust behaviors compared to baseline MPC and RL.
This paper introduces a Reinforcement Learning (RL) approach as a method for training a computer simulation of a bipedal lower limb exoskeleton to walk, without explicit supervision. A simulation of a bipedal exoskeleton with six joints (left and right hip, knee, and ankle) was developed, including an implementation of the Proximal Policy Optimization algorithm for training the RL agent. The system was implemented using Python and OpenAI Gym library, which provided an environment to simulate interactions and learn locomotion dynamics. The RL-trained agent is capable of learning stable locomotion by interacting with the simulated environment and a complex reward system. The results demonstrate the potential of RL for adaptive control of exoskeletons and serve as a foundation for further research in exoskeleton control and training.
Humanoid robots possess an anthropomorphic structure that makes them particularly suitable for performing diverse tasks in human-centered environments. However, this biomimetic design presents significant challenges in motion controller development, especially when operating in complex real-world conditions. Existing humanoid robots typically employ either model-based control or model-free reinforcement learning (RL) for locomotion on simple terrains. In this work, we focus on reference state generation for bipedal running, proposing an end-to-end RL framework for human-like motion control. Our approach successfully achieves zero-shot sim-to-real transfer of learned running skills, enabling physical robots to maintain stable running at $2 ~\mathrm{m} / \mathrm{s}$ on both flat and uneven terrain. This demonstrates the strong robustness and generalization capability of our reference state generator.
Learning human-like, robust bipedal walking remains difficult due to hybrid dynamics and terrain variability. We propose a lightweight framework that combines a gait generator network learned from human motion with Proximal Policy Optimization (PPO) controller for torque control. Despite being trained only on flat or mildly sloped ground, the learned policies generalize to steeper ramps and rough surfaces. Results suggest that pairing spectral motion priors with Deep Reinforcement Learning (DRL) offers a practical path toward natural and robust bipedal locomotion with modest training cost.
In recent years, significant progress has been made in the prototype design and control methodologies of modular snake robots. However, there is still relatively little research on the potential enabled by the active morphological transformation of robots. This paper presents a novel modular snake robot capable of morphing into a bipedal configuration. The robot, ZBOT, is composed of some independent and homogeneous unit modules (named ZBot) connected in series. Each ZBot module has a dual-motor-driven 1-DoF rotational joint, which can rotate continuously, provide a large output torque and achieve backlash elimination. There are four connection orientations between adjacent modules. This paper proposes an articulation configuration, which enables the snake robot to achieve the active transformation from a snake form to a bipedal form. Meanwhile, through reinforcement learning (RL), movements including the stand-up gait are trained and verified in the IsaacSim/Lab simulation environment. This research will advance snake robots beyond surface-dependent locomotion, endowing them with more possibilities, unlocking greater potential for versatile applications.
Standard alternating leg motions serve as the foundation for simple bipedal gaits, and the effectiveness of the fixed stimulus signal has been proved in recent studies. However, in order to address perturbations and imbalances, robots require more dynamic gaits. In this paper, we introduce dynamic stimulus signals together with a bipedal locomotion policy into reinforcement learning (RL). Through the learned stimulus frequency policy, we induce the bipedal robot to obtain both three-dimensional (3D) locomotion and an adaptive gait under disturbance without relying on an explicit and model-based gait in both the training stage and deployment. In addition, a set of specialized reward functions focusing on reliable frequency reflections is used in our framework to ensure correspondence between locomotion features and the dynamic stimulus. Moreover, we demonstrate efficient sim-to-real transfer, making a bipedal robot called BITeno achieve robust locomotion and disturbance resistance, even in extreme situations of foot sliding in the real world. In detail, under a sudden change in torso velocity of −1.2 m/s in 0.65 s, the recovery time is within 1.5–2.0 s.
In this study, we introduce an innovative gait learning methodology for three-dimensional bipedal robots, integrating Hybrid Zero Dynamics (HZD) priors with periodic reward functions to enhance gait stability, symmetry, and smooth action transitions. Notably, we employ a data-driven Bezier curve parameterization technique optimized through Reinforcement Learning (RL) to significantly improve learning efficiency and dynamic stability of the gait. The effectiveness of our approach is systematically validated across three key metrics: lateral deviation, training speed, and robustness in challenging environments. The results demonstrate our method's superiority in maintaining path stability, accelerating the learning process, and adapting to complex terrains over traditional gait learning approaches. This research not only advances the efficiency and stability of gait learning for three-dimensional bipedal robots but also provides valuable insights for future studies on robotic gait optimization.
The intelligent control of walking of humanoid robots is one of the key research directions in the field of robotics research, but the traditional motion model's constraint on the center of mass makes it difficult for robots to maintain walking stability and simulate human gait at the same time for its excessive attention on stability. Compared with the traditional deep reinforcement learning algorithm, the DDPG algorithm has the ability to handle continuous action space and high-dimensional state space, and has a wide application prospect in many practical problems. We proposed a bipedal robot control method based on reward function optimization to enable the robot to achieve both stable high-speed movement and human gait imitation. This paper combines the physical characteristics of the bipedal robot to establish a control system based on the DDPG algorithm. At the same time, the reward function is designed to guide the robot to learn the correct walking strategy. Through the comparative test, the weight limit ratio of each reward function for the bipedal robot to stably increase the speed is given. The simulation results show that the method proposed in this paper has good practicality and effectiveness.
Generating multimodal locomotion in underactuated bipedal robots requires control solutions that can facilitate motion patterns for drastically different dynamical modes, which is an extremely challenging problem in locomotion-learning tasks. Also, in such multimodal locomotion, utilizing body morphology is important because it leads to energy-efficient locomotion. This study provides a framework that reproduces multimodal bipedal locomotion using passive dynamics through deep reinforcement learning (DRL). An underactuated bipedal model was developed based on a passive walker, and a controller was designed using DRL. By carefully planning the weight parameter settings of the DRL reward function during the learning process based on a curriculum learning method, the bipedal model successfully learned to walk, run, and perform gait transitions by adjusting only one command input. These results indicate that DRL can be applied to generate various gaits with the effective use of passive dynamics.
No abstract available
Abstract Although conventional gait pattern control in humanoid robots is typically performed on flat terrains, the roads that people walk on every day have bumps and potholes. Therefore, to make humanoid robots more similar to humans, the movement parameters of these robots should be modified to allow them to adapt to uneven terrains. In this study, to solve this problem, reinforcement learning (RL) was used to allow humanoid robots to engage in self-training and automatically adjust their parameters for ultimate gait pattern control. However, RL has multiple types, and each type has its own benefits and shortcomings. Therefore, a series of experiments were performed, and the results indicated that proximal policy optimization (PPO), combining advantage actor-critic and trust region policy optimization, was the most suitable method. Hence, an improved version of PPO, called PPO2, was used, and the experimental results indicated that the combination of deep RL with data preprocessing methods, such as wavelet transform and fuzzification, facilitated the gait pattern control and balance of humanoid robots.
In this paper, we propose a novel Reinforcement Learning (RL) algorithm for robotic motion control, that is, a constrained Deep Deterministic Policy Gradient (DDPG) deviation learning strategy to assist biped robots in walking safely and accurately. The previous research on this topic highlighted the limitations in the controller’s ability to accurately track foot placement on discrete terrains and the lack of consideration for safety concerns. In this study, we address these challenges by focusing on ensuring the overall system’s safety. To begin with, we tackle the inverse kinematics problem by introducing constraints to the damping least squares method. This enhancement not only addresses singularity issues but also guarantees safe ranges for joint angles, thus ensuring the stability and reliability of the system. Based on this, we propose the adoption of the constrained DDPG method to correct controller deviations. In constrained DDPG, we incorporate a constraint layer into the Actor network, incorporating joint deviations as state inputs. By conducting offline training within the range of safe angles, it serves as a deviation corrector. Lastly, we validate the effectiveness of our proposed approach by conducting dynamic simulations using the CRANE biped robot. Through comprehensive assessments, including singularity analysis, constraint effectiveness evaluation, and walking experiments on discrete terrains, we demonstrate the superiority and practicality of our approach in enhancing walking performance while ensuring safety. Overall, our research contributes to the advancement of biped robot locomotion by addressing gait optimisation from multiple perspectives, including singularity handling, safety constraints, and deviation learning.
This paper proposes an online bipedal footstep planning strategy that combines model predictive control (MPC) and reinforcement learning (RL) to achieve agile and robust bipedal maneuvers. While MPC-based foot placement controllers have demonstrated their effectiveness in achieving dynamic locomotion, their performance is often limited by the use of simplified models and assumptions. To address this challenge, we develop a novel foot placement controller that leverages a learned policy to bridge the gap between the use of a simplified model and the more complex full-order robot system. Specifically, our approach employs a unique combination of an Angular Momentum Linear Inverted Pendulum (ALIP)-based MPC foot placement controller for sub-optimal footstep planning and the learned policy for refining footstep adjustments, enabling the resulting footstep policy to capture the robot’s whole-body dynamics effectively. This integration synergizes the predictive capability of MPC with the flexibility and adaptability of RL. We validate the effectiveness of our framework through a series of experiments using the full-body humanoid robot DRACO 3. The results demonstrate significant improvements in dynamic locomotion performance, including better tracking of a wide range of walking speeds, enabling reliable turning and traversing challenging terrains while preserving the robustness and stability of the walking gaits compared to the baseline ALIP-based MPC approach.
This research develops and implements a novel reinforcement learning (RL) architecture to address the trajectory-tracking problem in bipedal robotic systems under articulated-joint constraints. The proposed RL framework extends previously designed adaptive controllers characterized by state-dependent gain structures. The learning mechanism comprises two hierarchical adaptation layers: the first employs an adaptive dynamic programming (ADP) formulation to approximate the Bellman value function using a class of continuous-time dynamic neural networks. In contrast, the second uses an iterative optimization scheme based on the deep deterministic policy gradient (DDPG) algorithm. The resulting control strategy minimizes a robust performance index defined over the tracking trajectories of a system with uncertain and nonlinear dynamics representative of bipedal locomotion. The dynamic programming formulation ensures robustness to bounded parametric uncertainties and external perturbations. By approximating the Hamilton–Jacobi–Bellman (HJB) value function using neural network structures, a closed-loop controller design is systematically established. Numerical simulations demonstrate the convergence of the tracking error to a region centered at the origin with a size that depends on the approximation quality of the selected neural network. To assess the effectiveness of the proposed approach, a conventional state-feedback control design is adopted as a benchmark, revealing that the suggested method produces a lower cumulative tracking error norm (0.023 vs. 0.037 rad·s) in the trajectory-tracking control problem for all robotic joints while simultaneously reducing the control effort required to complete motion tasks.
This work introduces a hierarchical strategy for terrain-aware bipedal locomotion that integrates reduced-dimensional perceptual representations to enhance reinforcement learning (RL)-based high-level (HL) policies for real-time gait generation. Unlike end-to-end approaches, our framework leverages latent terrain encodings via a Convolutional Variational Autoencoder (CNN-VAE) alongside reduced-order robot dynamics, optimizing the locomotion decision process with a compact state. We systematically analyze the impact of latent space dimensionality on learning efficiency and policy robustness. Additionally, we extend our method to be history-aware, incorporating sequences of recent terrain observations into the latent representation to improve robustness. To address real-world feasibility, we introduce a distillation method to learn the latent representation directly from depth camera images and provide preliminary hardware validation by comparing simulated and real sensor data. We further validate our framework using the high-fidelity Agility Robotics (AR) simulator, incorporating realistic sensor noise, state estimation, and actuator dynamics. The results confirm the robustness and adaptability of our method, underscoring its potential for hardware deployment.
Biped robots have plenty of benefits over wheeled, quadruped, or hexapod robots due to their ability to behave like human beings in tough and non-flat environments. Deformable terrain is another challenge for biped robots as it has to deal with sinkage and maintain stability without falling. In this study, we are proposing a Deep Deterministic Policy Gradient (DDPG) approach for motion control of a flat-foot biped robot walking on deformable terrain. We have considered a 7-link biped robot for our proposed approach. For soft soil terrain modeling, we have considered triangular Mesh to describe its geometry, where mesh parameters determine the softness of soil. All simulations have been performed on PyChrono, which can handle soft soil environments.
No abstract available
In this brief, a parallel Deep Deterministic Policy Gradient (DDPG) algorithm is presented for biped robot gait control. Biped robot gait control is a high-dimensional continuous problem. It is challenging to obtain a fast and stable gait. Traditional methods cannot fully utilize autonomous exploration capability of a biped robot. A multiple Actor-Critic (AC) network is established to expand the scope of exploration and improve training efficiency. For optimizing experience replay mechanism, an experience filtering unit is introduced, and a cosine similarity method is used to classify experience. Then, a Markov Decision Process (MDP) model based on knowledge and experience is designed to solve the problem of sparse rewards. Finally, experimental results show that the parallel DDPG algorithm can make the biped robot walk more quickly and stably, and the speed reaches 0.62 m/s.
Reinforcement learning (RL) has shown promise in generating robust locomotion policies for bipedal robots, but often suffers from tedious reward design and sensitivity to poorly shaped objectives. In this work, we propose a structured reward shaping framework that leverages model-based trajectory generation and control Lyapunov functions (CLFs) to guide policy learning. We explore two model-based planners for generating reference trajectories: a reduced-order linear inverted pendulum (LIP) model for velocity-conditioned motion planning, and a precomputed gait library based on hybrid zero dynamics (HZD) using full-order dynamics. These planners define desired end-effector and joint trajectories, which are used to construct CLF-based rewards that penalize tracking error and encourage rapid convergence. This formulation provides meaningful intermediate rewards, and is straightforward to implement once a reference is available. Both the reference trajectories and CLF shaping are used only during training, resulting in a lightweight policy at deployment. We validate our method both in simulation and through extensive real-world experiments on a Unitree G1 robot. CLF-RL demonstrates significantly improved robustness relative to the baseline RL policy and better performance than a classic tracking reward RL formulation.
Bipedal humanoid robots represent a crucial avenue in robotics development due to their adaptability to human-centric work environments, spanning industrial, service, and rescue sectors. The attainment of robust and energy-efficient bipedal walking in real-world scenarios has persistently challenged both industrial and academic research. Two primary approaches exist in robot locomotion control: traditional model-based methods heavily reliant on environmental factors, burdened by intricate modeling complexities, and lacking generalization capabilities. The potential for advancements in adaptive locomotion control, often impeded by complex modeling processes, can be significantly enhanced through the application of reinforcement learning (RL). This paper introduces a newly developed full-scale bipedal humanoid robot named Xiao-Man. A RL-based actor-critic network is designed to facilitate the robot's terrain-adaptive and efficient walking behavior. The control policy training process incorporates task rewards and auxiliary rewards to achieve robust and energy-efficient bipedal walking. To support this, we have curated a dataset based on joint actuation truth data and trained a joint actuator network to bridge the gap between expected torque and actual response torque.The results demonstrate that our trained control policy empowers the bipedal humanoid robot to achieve robust, energy-efficient bipedal walking and adaptability to complex terrains using solely proprioceptive information.
Bipedal humanoid robot has the ability to both move and manipulate in complex environments, which is of great significance in the future. However, stable bipedal walking in the real world has always been a challenge in industry and even in academia. The traditional model-based methods are highly dependent on the environment, with high modeling complexity and lack of generalization. The solution based on the simplified model usually causes the problem that the control algorithms cannot adapt to complex terrain environment. This paper presents a newly designed bipedal humanoid robot, Xiao-Man. Aiming at achieving the robot’s terrain-adaptive walking behavior, a reinforcement learning based Actor-Critic network with asymmetric structure is proposed. Without using any external perception information, robust bipedal walking behavior of Xiao-Man is achieved. In the process, we also build the dataset based on the joint actuation truth data and train a joint actuator network to reduce the gap between the expected torque and the actual response torque. Experimental results show that the bipedal humanoid robot equipped with the trained control policy achieves the capability of stable walking and disturbance rejection only rely on proprioceptive information.
In recent years, humanoid robots have garnered significant attention from both academia and industry due to their high adaptability to environments and human-like characteristics. With the rapid advancement of reinforcement learning, substantial progress has been made in the walking control of humanoid robots. However, existing methods still face challenges when dealing with complex environments and irregular terrains. In the field of perceptive locomotion, existing approaches are generally divided into two-stage methods and end-to-end methods. Two-stage methods first train a teacher policy in a simulated environment and then use distillation techniques, such as DAgger, to transfer the privileged information learned as latent features or actions to the student policy. End-to-end methods, on the other hand, forgo the learning of privileged information and directly learn policies from a partially observable Markov decision process (POMDP) through reinforcement learning. However, due to the lack of supervision from a teacher policy, end-to-end methods often face difficulties in training and exhibit unstable performance in real-world applications. This paper proposes an innovative two-stage perceptive locomotion framework that combines the advantages of teacher policies learned in a fully observable Markov decision process (MDP) to regularize and supervise the student policy. At the same time, it leverages the characteristics of reinforcement learning to ensure that the student policy can continue to learn in a POMDP, thereby enhancing the model’s upper bound. Our experimental results demonstrate that our two-stage training framework achieves higher training efficiency and stability in simulated environments, while also exhibiting better robustness and generalization capabilities in real-world applications.
Recent advancements in reinforcement learning (RL) have led to significant progress in humanoid robot locomotion, simplifying the design and training of motion policies in simulation. However, the numerous implementation details make transferring these policies to real-world robots a challenging task. To address this, we have developed a comprehensive code framework that covers the entire process from training to deployment, incorporating common RL training methods, domain randomization, reward function design, and solutions for handling parallel structures. This library is made available as a community resource, with detailed descriptions of its design and experimental results. We validate the framework on the Booster T1 robot, demonstrating that the trained policies seamlessly transfer to the physical platform, enabling capabilities such as omnidirectional walking, disturbance resistance, and terrain adaptability. We hope this work provides a convenient tool for the robotics community, accelerating the development of humanoid robots. The code can be found in https://github.com/BoosterRobotics/booster_gym.
Safe and real-time navigation is fundamental for humanoid robot applications. However, existing bipedal robot navigation frameworks often struggle to balance computational efficiency with the precision required for stable locomotion. We propose a novel hierarchical framework that continuously generates dynamic subgoals to guide the robot through cluttered environments. Our method comprises a high-level reinforcement learning (RL) planner for subgoal selection in a robot-centric coordinate system and a low-level Model Predictive Control (MPC) based planner which produces robust walking gaits to reach these subgoals. To expedite and stabilize the training process, we incorporate a data bootstrapping technique that leverages a model-based navigation approach to generate a diverse, informative dataset. We validate our method in simulation using the Agility Robotics Digit humanoid across multiple scenarios with random obstacles. Results show that our framework significantly improves navigation success rates and adaptability compared to both the original model-based method and other learning-based methods.
Achieving precise gait planning and high robustness in locomotion control is crucial for the development and application of humanoid robots. In this paper, a novel control strategy is proposed, which combines dynamics control and reinforcement learning (RL), leveraging the precision of dynamics control and the robustness of RL. Specifically, foot placements for each step of the humanoid robot are designed, and the trajectories of the center of mass (CoM) and feet are obtained using a 3D linear inverted pendulum model (3D LIPM). Subsequently, joint angles during motion are calculated based on the trajectories of the CoM and feet using inverse kinematics equations. Finally, the obtained joint angles are trained as baseline actions using RL algorithms. To enhance control robustness, parameter domain randomization is introduced during the training process. By employing this control strategy, simulations of various single-step gaits, such as walking forward, walking to the right, and making right turns, are achieved. Additionally, trajectory tracking, locomotion tests on different terrains, and disturbance resistance are conducted. The simulation results demonstrate that the proposed control strategy enables precise gait control and exhibits strong robustness in humanoid robots.
Stable locomotion of humanoid robots in complex and unstructured terrains remains a significant challenge, primarily due to their intricate and highly coupled dynamics, as well as the inherent sim-to-real gap that complicates direct transfer of simulated policies to physical robots. This paper introduces a novel Motor-Guided Randomization Reinforcement Learning (MGRL) framework, specifically designed to substantially improve both the robustness and environmental adaptability of bipedal locomotion. At its core, MGRL incorporates an adaptive noise perturbation during reinforcement learning training. This perturbation, dynamically adjusted based on the robot's motor output torque, effectively simulates the inevitable uncertainties and non-idealities present in real-world motor performance, such as friction, backlash, and varying efficiency. By systematically exposing the policy to these physics-level non-idealities during training, MGRL significantly enhances the robot's walking stability, disturbance rejection capabilities, and overall task success rate across a wide array of challenging terrains, including steep slopes, irregular steps, and highly uneven surfaces. Furthermore, this principled approach not only mitigates issues arising from discrepancies between simulated and real-world physics but also prevents the policy from over-adapting to overly idealized or unrealistic disturbances, leading to more transferable and reliable behaviors.
In this paper, we describe a model-free reinforcement learning method for gait controlling of humanoid robots, which combines Q-learning with Radial Basis Function Network. With the help of RBF Network, this method can solve the approximation problem caused by continuous state space and action space. The approach is applied to the controllers on hip joints of humanoid robots that receives sensory data and constantly adjusts the outputs of steering engines on hip joints, finding an optimal policy that can guide humanoid robots to walk stably on different uneven terrains. We have tested the approach on Webots, a simulation platform, and experiment results have proven the validity of the proposed method.
The humanoid gait generation task of biped robots has been a lasting challenge. To achieve a stable gait in such a complex kinematic system, with the improvement of computing power and algorithms updates, machine learning methods such as reinforcement learning (RL) are widely applied to this topic and made progressive results. RL provides a convenient end-to-end solution and has become the mainstream machine learning method in the field of robotics. However, there are still problems when using traditional RL methods in aspects of similarity to human natural gait, training time, robustness to perturbances, and generalization ability of other tasks. In this paper, we propose an improved framework based on the feedforward enhanced reinforcement learning (FERL) algorithm to efficiently generate a humanoid gait for a small humanoid robot KRH-3HV. In FERL, the control action of biped robots consists of two parts: the reference part and the weighted RL part. In this paper, we introduced prior knowledge that human walking gait exhibits sinusoidal characteristics and designed reference actions by inverse kinematic analysis based on this principle. The weighted RL part is independently set to motivate the biped robot to complete the target task. Providing a reference control signal sequence helps to generate an ideal gait and the action space of the weighted RL is decreased compared with traditional RL so that the training time is reduced while still maintaining certain robustness. By setting up 4 groups of control experiments in a simulation environment, we efficiently generated an effective humanoid gait to complete the target tasks of walking on flat ground and climbing stairs, and proved that the proposed method works well in gait biomimetic similarity, training efficiency, robustness, and generalization ability to other tasks by simple reward function designing.
This research focused on whether Deep Reinforcement Learning (Deep RL) can enable a cheap, small humanoid robot to develop complex and secure movement skills, which can then be integrated into intricate behavioral strategies in dynamic environments. The humanoid robot, equipped with 20 actuated joints, was trained using Deep RL to participate in a skewed one-versus-one (1v1) soccer match. Initially, we taught the robot to separate skills before combining them in a self-play scenario. The final policy demonstrated robust and dynamic movement characteristics such as rapid fall recovery, walking, turning, kicking, and more, exceeding the logical expectations for the robot. The robot seamlessly transitioned between these skills in a stable and effective manner. Moreover, the agents developed an understanding of the game’s strategy, including predicting ball trajectories and countering opponent moves. Surprisingly, a limited number of straightforward rewards led to a wide range of behaviors. The agents underwent simulated training before being transferred to real robots. Despite unmodeled effects and differences between robot instances, successful transfer was assisted by using a combination of high-voltage control, concentrated dynamics selection, and training with disturbances in simulation. By introducing small hardware modifications and applying basic behavior regularization during training, the robots achieved secure maneuvers while maintaining dynamic and agile performance. Notably, the agents in the experiments demonstrated significant improvements over a programmed baseline, walking 156% faster, standing up 63% faster, and kicking 24% quicker, all while effectively combining their skills to achieve long-term goals.
Research in the field of robotics is complex, especially when talking about humanoid robots. The problem of gait of the robots is one that several researchers have previously faced, proposing different models throughout history but without a completely accurate result. That is why in the present thesis we seek to propose a way of walking based on machine learning, more specifically on reinforcement learning algorithms. This method seeks to make the robot walk as close as possible to the gait of a person, using a database extracted from various tests in order to teach the robot how humans walk.
No abstract available
Using Reinforcement Learning to Improve the Stability of a Humanoid Robot: Walking on Sloped Terrain
No abstract available
Abstract Teaching a humanoid robot to walk is an open and challenging problem. Classical walking behaviors usually require the tuning of many control parameters (e.g., step size, speed). To find an initial or basic configuration of such parameters could not be so hard, but optimizing them for some goal (for instance, to walk faster) is not easy because, when defined incorrectly, may produce the fall of the humanoid, and the consequent damages. In this paper we propose the use of Safe Reinforcement Learning for improving the walking behavior of a humanoid that permits the robot to walk faster than with a pre-defined configuration. Safe Reinforcement Learning assumes the existence of a safe baseline policy that permits the humanoid to walk, and probabilistically reuse such a policy to learn a better one, which is represented following a case based approach. The proposed algorithm has been evaluated in a real humanoid robot proving that it drastically increases the learning speed while reduces the number of falls during learning when compared with state-of-the-art algorithms.
This paper addresses the issue of stiffness in the upper body and lack of coordination between the upper and lower body during humanoid robot walking. An improved humanoid robot reinforcement learning algorithm incorporating an LSTM framework is proposed to optimize full-body coordinated movement. Based on the Humanoid-Gym framework, a novel reward mechanism is designed, taking into account the detailed evaluation of arm movement and the collaborative control between the arms and thighs. The reinforcement learning model adopts an Actor-Critic architecture, integrating the LSTM framework into the network to enhance the feature extraction and dynamic modeling capabilities. Finally, experiments were conducted using the Hi ROBOT humanoid platform to validate the proposed model. The proposed LSTM network algorithm is compared with the original network, GRU, CNN, and other networks, demonstrating the superiority of the model. Compared with other networks, the performance improves by approximately $3.4 \%$ in terms of reward metric, and the model reaches the performance level of the original network after 40k training steps, as opposed to 60 k steps. It also maintains a fast convergence rate. Additionally, the optimized algorithm results in better gait arm swing and leg coordination, with smoother and more coordinated movement, closest to the human walking pattern.
No abstract available
This paper presents IRSE X1, a lightweight humanoid robot designed for robust blind locomotion using only proprioceptive sensing. IRSE X1 stands 1.2 meters tall, weighs 12 kilograms, and features a minimalistic design combining aluminum sheet metal, machined parts, and 3D-printed components. It utilizes an inertial measurement unit (IMU) as its sole sensor, eliminating reliance on exteroceptive devices such as LiDAR or depth cameras. Locomotion policies are trained via reinforcement learning in Isaac Gym and transferred to MuJoCo for high-fidelity validation, achieving stable walking across challenging terrains, including slopes, stairs, and uneven ground. The robot's electrical and mechanical architecture prioritizes simplicity, cost-effectiveness, and maintainability, with all components (except the power board) sourced off-the-shelf. Experimental results demonstrate that IRSE X1 achieves reliable blind walking while significantly lowering weight, system complexity, and development barriers compared to conventional humanoid platforms. This work highlights a pathway toward affordable, robust, and adaptable humanoid robots for diverse real-world environments.
As the application demand for humanoid robots in complex and unstructured environments increases, how to balance adaptability and stability in control strategies has become increasingly critical. This paper compares and analyzes two typical humanoid robot control methods: a reinforcement learning-based controller on the Digit V3 platform and a multi-contact planning and control (MCPC) framework on the COMAN+ platform. This work addresses a gap in the literature that lacks a comparative perspective on cross-method and cross-platform practical verification. It first introduces the design concept and training mechanism of the reinforcement learning controller, which removes reliance on gait clocks and achieves the natural switching between standing and walking under external disturbances. Then it analyzes the MCPC framework, which combines posture sampling, nonlinear programming (NLP) trajectory optimization, and torque-level balance control to support the robot to stably perform complex multi-contact tasks. Experimental results show that the reinforcement learning controller exhibits excellent robustness in disturbance response and command switching, while the MCPC method shows higher accuracy and repeatability in structured tasks. The results of this study show that reinforcement learning is suitable for dealing with scenarios with strong dynamic adaptability, while planning control emphasizes interpretability and physical feasibility. The comparative analysis presented in this article provides a reference for understanding the trade-offs in humanoid robot control strategies and also offers guidance for truly realizing embodied intelligence in humanoid robots in the future.
Full-sized humanoid robot capabilities have grown exponentially in recent years, aiming towards general-purpose deployment in human environments. A popular control method used by manufacturers utilizes Virtual Reality for upperbody teleoperation and Reinforcement Learning for lowerbody balance and locomotion control. As a result, a single remote operator can see, manipulate, and navigate about a real, distant physical environment. This powerful control stack is often relegated to expensive full-sized robots, many of which are inaccessible to the research community. Miniature humanoids are more prevalent, but employ less biomimicry in their design (e.g. fewer sensors, Degrees of Freedom, etc) and lack similar developments. This paper describes a compliant full-body telepresence control stack developed from the ground up for miniature humanoids. Framework experimentation on ROBOTIS OP3 hardware showcases walking at speeds up to $0.45 \mathrm{m} / \mathrm{s}$ independent of arm motions. Tele-loco-manipulation is demonstrated via cube relocation experiment with an expert human operator. On average, the teleoperated system moved 2 different 40 g cubes within 10 mins, walking a total distance of 5 m. Overall, the developed system shows potential for miniature humanoid tele-loco-manipulation.
Most humanoid robots walk with crouched legs, in an effort to avoid the singularity problem in articulated robot control. In comparison, a human walks with straight legs. We used a reinforcement learning approach to enable human-like straight-leg walking for humanoid robots. First, we train different models to control the humanoid robot to walk and stand using reinforcement learning. Then we train a switching model to achieve smooth switching between walking and standing. After training in simulation, we deploy the model on a real humanoid robot. We bridge the gap between simulation and reality by combining visual characteristics for the robot state estimation. Thus, the humanoid robot can move stably on complex terrains with straightened legs, including walking in different directions, adapting to external disturbances, and standing still on discrete obstacles and slopes. Straight-leg locomotion is more human-like and can reduce 23% motor torques.
This paper proposes a hierarchical reinforcement learning control method and applies it to the walking control of a humanoid robot. Firstly, a reinforcement learning algorithm called proximal policy optimization(PPO) is combined with central pattern generator(CPG). Thus an united hierarchical reinforcement learning (UHRL) is built which could cooperates with high and low levels control tasks. Secondly, the particle swarm optimization algorithm is used to obtain the initial parameter configuration of CPG. So that the robot can generate basic walking gait at the beginning of the experiment. The particle swarm variance fitness is used as the variation constraint to prevent the optimization process from falling into precocious convergence. Thirdly, the reward function of the high-level controller is designed to help the humanoid robot avoid deviation from the original path.
In this paper, a novel reinforcement learning method that enables a humanoid robot to learn bipedal walking using a simple reference motion is proposed. Reinforcement learning has recently emerged as a useful method for robots to learn bipedal walking, but, in many studies, a reference motion is necessary for successful learning, and it is laborious or costly to prepare a reference motion. To overcome this problem, our proposed method uses a simple reference motion consisting of three sine waves and automatically sets the waveform parameters using Bayesian optimization. Thus, the reference motion can easily be prepared with minimal human involvement. Moreover, we introduce two means to facilitate reinforcement learning: (1) we combine reinforcement learning with inverse kinematics (IK), and (2) we use the reference motion as a bias for the action determined via reinforcement learning, rather than as an imitation target. Through numerical experiments, we show that our proposed method enables bipedal walking to be learned based on a small number of samples. Furthermore, we conduct a zero-shot sim-to-real transfer experiment using a domain randomization method and demonstrate that a real humanoid robot, KHR-3HV, can walk with the controller acquired using the proposed method.
Training a humanoid robot to walk is an interesting and difficult problem. Usually, numerous control elements need to be tuned for traditional walking movements (For example, step, size, and pace). Determining an initial or basic arrangement of these kinds of features shouldn’t be complex, however optimizing them as some target (for instance, walking better) becomes tough because poorly specified features can make a humanoid collapse and thus cause damage. Therefore, in this article, we suggested using the Deep RL approach with an ideal soft actor-critic algorithm that utilizes the reinforcement learning architecture with discrete variables to enhance the wandering movements of a humanoid robot such that it can move quickly compared to a pre-defined arrangement. In this approach, the agent strives to increase predicted reward as well as entropy. Reliable Reinforcement Learning presumes the existence of a safe baseline policy that allows the humanoid to walk. As a result, the robot can be taught how to walk using the simulation from our project as input. Finally, we illustrate that, unlike many other off-policy algorithms, our approach is quite consistent, having comparatively similar results across various spontaneous samples. We evaluate SAC’s performance against other agents’ cumulative rewards for every episode as a standard. Regardless of the state, the random agent picks a sample at random along the space along a path. Clusters of cumulative reward are found just below 0 and just above -100 after evaluating the random agent for 3000 episodes.
We present Human to Humanoid (H2O), a reinforcement learning (RL) based framework that enables real-time whole-body teleoperation of a full-sized humanoid robot with only an RGB camera. To create a large-scale retargeted motion dataset of human movements for humanoid robots, we propose a scalable "sim-to-data" process to filter and pick feasible motions using a privileged motion imitator. Afterwards, we train a robust real-time humanoid motion imitator in simulation using these refined motions and transfer it to the real humanoid robot in a zero-shot manner. We successfully achieve teleoperation of dynamic whole-body motions in real-world scenarios, including walking, back jumping, kicking, turning, waving, pushing, boxing, etc. To the best of our knowledge, this is the first demonstration to achieve learning-based real-time whole-body humanoid teleoperation.
The development of humanoid robots and the improvements in robot-assisted mobility have created a huge need for robots that can learn to walk in challenging environments. Hand engineered approaches might work in constrained environments, but learning-based approaches may prove to be superior owing to their greater generalization. In this work, a simple reinforcement learning technique has been designed for the biped robot's real-time gait planning. The bipedal walking is a process in which the robot constantly engages with its surroundings, evaluates the effectiveness of its control actions based on how it is moving, and then modifies its control approach accordingly. The continuous state space and action space are obtained using the Actor-Critic Model (ACM) reinforcement learning method, which also directly determines the robot's final gait without using the reference gait. In order to decrease the amount of actual robot training, speed up training, and guarantee the acquisition of the final gait, the trained model is transferred to the walking control of the actual robot. Finally, a biped robot is created and constructed to test the viability of the suggested approach. Several experiments demonstrate that the suggested approach can produce the continuous and stable gait planning needed for the biped robot.
Gait generation is significant for a humanoid robot to realize flexible motion adapt to complex environments. With the wide applications of data-driven methods, gait generation approaches based on reinforcement learning have been represented. In this study, we proposed a multi-speed walking gait generation method based on reinforcement learning and human motion imitation. Multi-speed walking gaits were generated with imitation of human walking of only one speed. Moreover, we also analyzed multi-speed walking gait generation for a biped robot with reduced human motion data (e.g. motion data of trunk orientation and hip and knee angles without ankle angle, or motion data of trunk orientation and hip angle only) and reduced training time. This study provides a novel method for generation of multi-speed and human-like walking gaits of biped robot.
In recent years, many countries have increased their investment in the field of humanoid robots, promoting significant technological development. This study aims to enable humanoid robots to better adapt to various complex environments, enhancing the robustness of their motion systems and the generalization ability of their motion strategies. Using reinforcement learning algorithms, training on varied terrain is a critical factor for developing adaptable humanoid robots. This paper takes the humanoid robot G1 as the research platform. First, it completes the training, transfer verification, and real-machine deployment of a flat-ground walking model. Then, using fuzzy logic control and a phased training strategy, walking models for ascending/descending stairs and traversing slopes are trained. By systematically varying the stair height and slope gradient, the convergence of the reward function and the task completion success rate are analyzed. Furthermore, the dynamic stability of the robot on complex terrains is validated through qualitative kinematic analysis. The research concludes that as the single-step height and slope gradient increase, the reward value initially rises with more iterations but converges more slowly and at a lower final value. Statistical analysis shows that the success rates of phased training for stair and slope terrains are higher than 86% and 92%, respectively.
In this work, we introduce a control framework that combines model-based footstep planning with Reinforcement Learning (RL), leveraging desired footstep patterns derived from the Linear Inverted Pendulum (LIP) dynamics. Utilizing the LIP model, our method forward predicts robot states and determines the desired foot placement given the velocity commands. We then train an RL policy to track the foot placements without following the full reference motions derived from the LIP model. This partial guidance from the physics model allows the RL policy to integrate the predictive capabilities of the physics-informed dynamics and the adaptability characteristics of the RL controller without overfitting the policy to the template model. Our approach is validated on the MIT Humanoid, demonstrating that our policy can achieve stable yet dynamic locomotion for walking and turning. We further validate the adaptability and generalizability of our policy by extending the locomotion task to unseen, uneven terrain. During the hardware deployment, we have achieved forward walking speeds of up to 1.5 m/s on a treadmill and have successfully performed dynamic locomotion maneuvers such as 90-degree and 180-degree turns.
We introduce Berkeley Humanoid, a reliable and low-cost mid-scale humanoid research platform for learningbased control. Our lightweight, in-house-built robot is designed specifically for learning algorithms with accurate simulation, low simulation complexity, anthropomorphic motion, and high reliability against falls. The narrow sim-to-real gap enables agile and robust locomotion across various terrains in outdoor environments, achieved with a simple reinforcement learning controller using light domain randomization. Furthermore, we demonstrate the robot traversing for hundreds of meters, walking on a steep unpaved trail, and hopping with single and double legs as a testimony to its high performance in dynamic walking. Capable of omnidirectional locomotion and withstanding large perturbations with a compact setup, our system aims for rapid sim-to-real deployment of learningbased humanoid systems. Please check our website https:// berkeley-humanoid.com/ and code https://github. com/HybridRobotics/isaac_berkeley_humanoid/.
Bipedal walking is a challenging task for humanoid robots. In this study, we develop a lightweight reinforcement learning method for real-time gait planning of the biped robot.We regard bipedal walking as a process in which the robot constantly interacts with the environment, judges the quality of control action through the walking state, and then adjusts the control strategy. A mean-asynchronous advantage actor-critic (M-A3C) reinforcement learning algorithm is proposed to obtain the continuous state space and action space, and directly obtain the final gait of the robot without introducing the reference gait. We use multiple sub-agents of M-A3C algorithm to train multiple virtual robots independently at the same time in the physical simulation platform. Then we transfer the trained model to the walking control of the actual robot to reduce the number of training on the actual robot, improve the training speed, and ensure the acquisition of the final gait. Finally, a biped robot is designed and fabricated to verify the effectiveness of the proposed method. Various experiments show that the proposed method can achieve the biped robot’s continuous and stable gait planning.
Robotic locomotion research typically draws from biologically inspired leg designs, yet many human-engineered settings can benefit from nonanthropomorphic forms. TARS3D translates the block-shaped ‘TARS’ robot from Interstellar into a 0.25 m, 0.99 kg research platform with seven actuated degrees of freedom. The film shows two primary gaits: a bipedallike walk and a high-speed rolling mode. For TARS3D, we build reduced-order models for each, derive closed-form limitcycle conditions, and validate the predictions on hardware. Experiments confirm that the robot respects its $\pm 150^{\circ}$ hip limits, alternates left-right contacts without interference, and maintains an eight-step hybrid limit cycle in rolling mode. Because each telescopic leg provides four contact corners, the rolling gait is modeled as an eight-spoke double rimless wheel. The robot's telescopic leg redundancy implies a far richer gait repertoire than the two limit cycles treated analytically. So, we used deep reinforcement learning (DRL) in simulation to search the unexplored space. We observed that the learned policy can recover the analytic gaits under the right priors and discover novel behaviors as well. Our findings show that TARS3D's fiction-inspired biotranscending morphology can realize multiple previously unexplored locomotion modes and that further learning-driven search is likely to reveal more. This combination of analytic synthesis and reinforcement learning opens a promising pathway for multimodal robotics.
Accurate and precise terrain estimation is a difficult problem for robot locomotion in real-world environments. Thus, it is useful to have systems that do not depend on accurate estimation to the point of fragility. In this paper, we explore the limits of such an approach by investigating the problem of traversing stair-like terrain without any external perception or terrain models on a bipedal robot. For such blind bipedal platforms, the problem appears difficult (even for humans) due to the surprise elevation changes. Our main contribution is to show that sim-to-real reinforcement learning (RL) can achieve robust locomotion over stair-like terrain on the bipedal robot Cassie using only proprioceptive feedback. Importantly, this only requires modifying an existing flat-terrain training RL framework to include stair-like terrain randomization, without any changes in reward function. To our knowledge, this is the first controller for a bipedal, human-scale robot capable of reliably traversing a variety of real-world stairs and other stair-like disturbances using only proprioception.
We propose a reinforcement learning (RL) based hierarchical control framework for path tracking of a wheeled bipedal robot. The framework consists of three control levels. 1) The high-level RL is used to obtain an optimal policy through trial and error in a simulated environment. 2) The middle-level Lyapunov-based non-linear controller is utilized to track a desired line with strong robustness and high stability. 3) The low-level PID-based controller is implemented to simultaneously achieve both balancing and velocity tracking for a physical wheeled bipedal robot in real world. Thanks to the middle-level controller, the offline trained policy in simulation can be directly employed on the physical robot in real time without tuning any parameters. Moreover, the high-level policy network is able to improve optimality and generality for the task of path tracking, as well to avoid the cumbersome process of manually tuning control gains. The experiment results in both simulation and real world demonstrate that the proposed hierarchical control framework can achieve quick, robust, and stable path tracking for a wheeled bipedal robot.
Bipedal robots have achieved remarkable locomotion capabilities through reinforcement learning (RL), yet their real-world deployment remains hindered by the sim-to-real gap—dynamics mismatches between simulation and reality that degrade locomotion performance through behavioral deviations. This work introduces an online behavior adaptation framework that bridges this gap at the behavioral level by dynamically aligning emergent locomotion strategies with simulation-derived objectives. Our method integrates two core innovations: (1) a structured latent space constructed via an augmented Variational Autoencoder (VAE), which quantifies behavioral divergence through domain-invariant representations of locomotion patterns, and (2) a closed-loop adaptation module that maps latent-space deviations to real-time adjustments in low-level controller parameters. By reformulating sim-to-real transfer as a problem of behavioral alignment rather than explicit dynamics matching, the framework enables continuous adaptation to unmodeled dynamics mismatch without requiring system identification or offline retraining. Extensive experimental evaluations demonstrate the effectiveness of the proposed method, highlighting its potential to bridge the behavior gap between simulation and reality. Note to Practitioners—This paper was motivated by the problem that the legged robot locomotion with learning-based controller could suffer performance drop due to the sim-to-real gap, which is directly reflected on the behavior deviation between the simulated and real-world robot. Traditional methods require heavy man-crafted parameter tuning. This paper proposes a novel online behavior adaptation framework to alleviate the sim-to-real gap, using a trained robot behavior encoding network and a behavior adaptation network. This framework enables the robot to detects behavioral deviations and adjusts low-level control parameters automatically. The simulated and real-world experiments suggest the proposed framework is feasible to reduce the behavior deviation of real-world robot locomotion with the simulated robot with unmodeled dynamics mismatch by self-correcting locomotion strategies in real time when faced with unexpected disturbances, and reduce manual tuning efforts. But it has not yet been validated on complex terrains, so in future research, we will further validate the proposed framework on complex terrains.
Reinforcement learning (RL) offers a promising solution for controlling humanoid robots, particularly for bipedal locomotion, by learning adaptive and flexible control strategies. However, direct RL application is hindered by time-consuming trial-and-error processes, necessitating training in simulation before real-world transfer. This introduces a reality gap that degrades performance. Although various methods have been proposed for sim-to-real transfer, they have not been validated on a consistent hardware platform, making it difficult to determine which components are key to overcoming the reality gap. In contrast, we systematically evaluate techniques to enhance RL policy robustness during sim-to-real transfer by controlling variables and comparing them on a single robot to isolate and analyze the impact of each technique. These techniques include dynamics randomization, state history usage, noise/bias/delay modeling, state selection, perturbations, and network size. We quantitatively assess the reality gap by simulating diverse conditions and conducting experiments on real hardware. Our findings provide insights into bridging the reality gap, advancing robust RL-trained humanoid robots for real-world applications.
Recent advances in both legged robot locomotion and Reinforcement Learning have shown a promising path for developing bipedal robot controllers. While the difference in dynamics between real world and simulation, also known as reality gap, still hinders the use. In this paper, we focus on sim-to-real bipedal robot locomotion task. We leverage the recent advances in auto-tuning sim-to-real transfer and use it to address sim-to-real bipedal robot locomotion problem. Similar to existing work, we first train a parameter searching model with dataset collected from simulator and use real-world data to tune the simulation parameters. However, the prediction tuning can be unreliable if the training dataset distribution fails to cover the real-world data. We address this problem by formulating this problem as an Out-of-distribution problem and further extending the current framework with a dataset verification model. With extended module, our method is capable of tuning the simulation parameters safely and efficiently. We demonstrate our method outperforms existing work and achieves sim-to-real bipedal robot locomotion on bipedal robot BITeno.
We study the problem of realizing the full spectrum of bipedal locomotion on a real robot with sim-to-real reinforcement learning (RL). A key challenge of learning legged locomotion is describing different gaits, via reward functions, in a way that is intuitive for the designer and specific enough to reliably learn the gait across different initial random seeds or hyperparameters. A common approach is to use reference motions (e.g. trajectories of joint positions) to guide learning. However, finding high-quality reference motions can be difficult and the trajectories themselves narrowly constrain the space of learned motion. At the other extreme, reference-free reward functions are often underspecified (e.g. move forward) leading to massive variance in policy behavior, or are the product of significant reward-shaping via trial-and-error, making them exclusive to specific gaits. In this work, we propose a reward-specification framework based on composing simple probabilistic periodic costs on basic forces and velocities. We instantiate this framework to define a parametric reward function with intuitive settings for all common bipedal gaits - standing, walking, hopping, running, and skipping. Using this function we demonstrate successful sim-to-real transfer of the learned gaits to the bipedal robot Cassie, as well as a generic policy that can transition between all of the two-beat gaits.
In this work we propose a learning-based approach to box loco-manipulation for a humanoid robot. This is a particularly challenging problem due to the need for whole-body coordination in order to lift boxes of varying weight, position, and orientation while maintaining balance. To address this challenge, we present a sim-to-real reinforcement learning approach for training general box pickup and carrying skills for the bipedal robot Digit. Our reward functions are designed to produce the desired interactions with the box while also valuing balance and gait quality. We combine the learned skills into a full system for box loco-manipulation to achieve the task of moving boxes from one table to another with a variety of sizes, weights, and initial configurations. In addition to quantitative simulation results, we demonstrate successful sim-to-real transfer on the humanoid robot Digit. To our knowledge this is the first demonstration of a learned controller for such a task on real world hardware.
Bipedal locomotion is a key challenge in robotics, particularly for robots like Bolt, which have a point-foot design. This study explores the control of such underactuated robots using constrained reinforcement learning, addressing their inherent instability, lack of arms, and limited foot actuation. We present a methodology that leverages Constraints-as-Terminations and domain randomization techniques to enable sim-to-real transfer. Through a series of qualitative and quantitative experiments, we evaluate our approach in terms of balance maintenance, velocity control, and responses to slip and push disturbances. Additionally, we analyze autonomy through metrics like the cost of transport and ground reaction force. Our method advances robust control strategies for point-foot bipedal robots, offering insights into broader locomotion. Videos and code are available at https://gepetto.github.io/BoltLocomotion/.
In this paper, a hierarchical and robust framework for learning bipedal locomotion is presented and successfully implemented on the 3D biped robot Digit built by Agility Robotics. We propose a cascade-structure controller that combines the learning process with intuitive feedback regulations. This design allows the framework to realize robust and stable walking with a reduced-dimensional state and action spaces of the policy, significantly simplifying the design and increasing the sampling efficiency of the learning method. The inclusion of feedback regulation into the framework improves the robustness of the learned walking gait and ensures the success of the sim-to-real transfer of the proposed controller with minimal tuning. We specifically present a learning pipeline that considers hardware-feasible initial poses of the robot within the learning process to ensure the initial state of the learning is replicated as close as possible to the initial state of the robot in hardware experiments. Finally, we demonstrate the feasibility of our method by successfully transferring the learned policy in simulation to the Digit robot hardware, realizing sustained walking gaits under external force disturbances and challenging terrains not incurred during the training process. To the best of our knowledge, this is the first time a learning-based policy is transferred successfully to the Digit robot in hardware experiments.
Sim-to-real is a mainstream method to cope with the large number of trials needed by typical deep reinforcement learning methods. However, transferring a policy trained in simulation to actual hardware remains an open challenge due to the reality gap. In particular, the characteristics of actuators in legged robots have a considerable influence on sim-to-real transfer. There are two challenges: 1) High reduction ratio gears are widely used in actuators, and the reality gap issue becomes especially pronounced when backdrivability is considered in controlling joints compliantly. 2) The difficulty in achieving stable bipedal locomotion causes typical system identification methods to fail to sufficiently transfer the policy. For these two challenges, we propose 1) a new simulation model of gears and 2) a method for system identification that can utilize failed attempts. The method's effectiveness is verified using a biped robot, the ROBOTIS-OP3, and the sim-to-real transferred policy can stabilize the robot under severe disturbances and walk on uneven surfaces without using force and torque sensors.
Sim-to-real reinforcement learning (RL) for humanoid robots with high-gear ratio actuators remains challenging due to complex actuator dynamics and the absence of torque sensors. To address this, we propose a novel RL framework leveraging foot-mounted inertial measurement units (IMUs). Instead of pursuing detailed actuator modeling and system identification, we utilize foot-mounted IMU measurements to enhance rapid stabilization capabilities over challenging terrains. Additionally, we propose symmetric data augmentation dedicated to the proposed observation space and random network distillation to enhance bipedal locomotion learning over rough terrain. We validate our approach through hardware experiments on a miniature-sized humanoid EVAL-03 over a variety of environments. The experimental results demonstrate that our method improves rapid stabilization capabilities over non-rigid surfaces and sudden environmental transitions.
No abstract available
Developing robust walking controllers for bipedal robots is a challenging endeavor. Traditional model-based locomotion controllers require simplifying assumptions and careful modelling; any small errors can result in unstable control. To address these challenges for bipedal locomotion, we present a model-free reinforcement learning framework for training robust locomotion policies in simulation, which can then be transferred to a real bipedal Cassie robot. To facilitate sim-to-real transfer, domain randomization is used to encourage the policies to learn behaviors that are robust across variations in system dynamics. The learned policies enable Cassie to perform a set of diverse and dynamic behaviors, while also being more robust than traditional controllers and prior learning-based methods that use residual control. We demonstrate this on versatile walking behaviors such as tracking a target walking velocity, walking height, and turning yaw. (Video1)
This paper presents a reinforcement learning framework for navigation-free point-to-point walking in bipedal robots. Unlike traditional velocity-command-based approaches, we introduce displacement-based commands that align more naturally with the discrete stepping nature of legged locomotion. Our method enables smooth transitions between standing and walking, accurate control over the number of steps, and support for both discrete and continuous walking trajectories. We design a phase-encoded command format and train the policy using a history of proprioceptive states and well-crafted reward functions. The policy is trained with PPO in Isaac Lab and deployed on a full-size humanoid robot (1.8 meters tall, 80 kg, 6 DoF per leg). In simulation, the controller achieves an average tracking error of less than 0.11 meters in position and 0.05 radians in heading. On hardware, the maximum errors reach to 0.3 meters and 0.2 radians due to sim-to-real discrepancies. The proposed framework demonstrates robust and efficient point-to-point locomotion without the need for high-level navigation modules.
In this work, we propose a method to generate reduced-order model reference trajectories for general classes of highly dynamic maneuvers for bipedal robots for use in sim-to-real reinforcement learning. Our approach is to utilize a single rigid-body model (SRBM) to optimize libraries of trajectories offline to be used as expert references that guide learning by regularizing behaviors when incorporated in the reward function of a learned policy. This method translates the model's dynamically rich rotational and translational behavior to a full-order robot model and successfully transfers to real hardware. The SRBM's simplicity allows for fast iteration and refinement of behaviors, while the robustness of learning-based controllers allows for highly dynamic motions to be transferred to hardware. Within this work we introduce a set of transferability constraints that amend the SRBM dynamics to actual bipedal robot hardware, our framework for creating optimal trajectories for a variety of highly dynamic maneuvers as well as our approach to integrating reference trajectories for a high-speed running reinforcement learning policy. We validate our methods on the bipedal robot Cassie on which we were successfully able to demonstrate highly dynamic grounded running gaits up to 3.0 m/s.
Reinforcement learning (RL) for bipedal locomotion has recently demonstrated robust gaits over moderate terrains using only proprioceptive sensing. However, such blind controllers will fail in environments where robots must anticipate and adapt to local terrain, which requires visual perception. In this paper, we propose a fully-learned system that allows bipedal robots to react to local terrain while maintaining commanded travel speed and direction. Our approach first trains a controller in simulation using a heightmap expressed in the robot’s local frame. Next, data is collected in simulation to train a heightmap predictor, whose input is the history of depth images and robot states. We demonstrate that with appropriate domain randomization, this approach allows for successful sim-to-real transfer with no explicit pose estimation and no fine-tuning using real-world data. To the best of our knowledge, this is the first example of sim-to-real learning for vision-based bipedal locomotion over challenging terrains.
Training and deploying reinforcement learning (RL) policies for robots is a complex task, requiring careful design of reward functions, sim-to-real transfer, and performance evaluation across various robot configurations. These tasks traditionally demand significant human expertise and effort. To address these challenges, this paper introduces Anybipe, a novel, fully automated, end-to-end framework for training and deploying bipedal robots, leveraging large language models (LLMs) for reward function generation, while supervising model training, evaluation, and deployment. The framework integrates comprehensive quantitative metrics to assess policy performance, deployment effectiveness, and safety. Additionally, it allows users to incorporate prior knowledge and preferences, improving the accuracy and alignment of generated policies with expectations. We demonstrate how Anybipe reduces human labor while maintaining high levels of accuracy and safety, examined on three different bipedal robots, showcasing its potential for autonomous RL training and deployment.
Previous humanoid robot research works treat the robot as a bipedal mobile manipulation platform, where only the feet and hands contact the environment. However, we humans use all body parts to interact with the world, e.g., we sit in chairs, get up from the ground, or roll on the floor. Contacting the environment using body parts other than feet and hands brings significant challenges in both model-predictive control and reinforcement learning-based methods. An unpredictable contact sequence makes it almost impossible for model-predictive control to plan ahead in real time. The success of the zero-shot sim-to-real reinforcement learning method for humanoids heavily depends on the acceleration of GPU-based rigid-body physical simulator and simplification of the collision detection. Lacking extreme torso movement of the humanoid research makes all other components non-trivial to design, such as termination conditions, motion commands and reward designs. To address these potential challenges, we propose a general humanoid motion framework that takes discrete motion commands and controls the robot's motor action in real time. Using a GPU-accelerated rigid-body simulator, we train a humanoid whole-body control policy that follows the high-level motion command in the real world in real time, even with stochastic contacts and extremely large robot base rotation and not-so-feasible motion command. More details at https://project-instinct.github.io
In real-world systems, reinforcement learning policies often exhibit a lack of action smoothness, making it difficult to ensure safe and reliable control. While several methods have been proposed to improve action smoothness, they often suffer from high computational cost and exhibit limited adaptability to complex dynamics. We propose SelfDistilled Conditioning for Action Policy Smoothing (SCAPS), a memory-efficient and adaptive framework that promotes smooth actions through self-distillation by aligning the current policy with an exponential moving average of its past behaviors in a scalable manner. We evaluate the proposed approach on a bipedal wheeled robot in Isaac Sim and perform sim-to-sim transfer to MuJoCo to assess generalization. Compared to CAPS and Vanilla PPO, SCAPS achieves smoother actions and lower memory usage, confirming its effectiveness in simulated control environments.
For legged robots to match the athletic capabilities of humans and animals, they must not only produce robust periodic walking and running, but also seamlessly switch between nominal locomotion gaits and more specialized transient maneuvers. Despite recent advancements in controls of bipedal robots, there has been little focus on producing highly dynamic behaviors. Recent work utilizing reinforcement learning to produce policies for control of legged robots have demonstrated success in producing robust walking behaviors. However, these learned policies have difficulty expressing a multitude of different behaviors on a single network. Inspired by conventional optimization-based control techniques for legged robots, this work applies a recurrent policy to execute four-step, 90° turns trained using reference data generated from optimized single rigid body model trajectories. We present a training framework using epilogue terminal rewards for learning specific behaviors from pre-computed trajectory data and demonstrate a successful transfer to hardware on the bipedal robot Cassie.
—For legged robots to match the athletic capabilities of humans and animals, they must not only produce robust periodic walking and running, but also seamlessly switch be- tween nominal locomotion gaits and more specialized transient maneuvers. Despite recent advancements in controls of bipedal robots, there has been little focus on producing highly dynamic behaviors. Recent work utilizing reinforcement learning to produce policies for control of legged robots have demonstrated success in producing robust walking behaviors. However, these learned policies have difficulty expressing a multitude of different behaviors on a single network. Inspired by conventional optimization-based control techniques for legged robots, this work applies a recurrent policy to execute four-step, 90 ◦ turns trained using reference data generated from optimized single rigid body model trajectories. We present a novel training framework using epilogue terminal rewards for learning specific behaviors from pre-computed trajectory data and demonstrate a successful transfer to hardware on the bipedal robot Cassie.
No abstract available
Loco-manipulation, physical interaction of various objects that is concurrently coordinated with locomotion, remains a major challenge for legged robots due to the need for both precise end-effector control and robustness to unmodeled dynamics. While model-based controllers provide precise planning via online optimization, they are limited by model inaccuracies. In contrast, learning-based methods offer robustness, but they struggle with precise modulation of interaction forces. We introduce Rambo, a hybrid framework that integrates model-based whole-body control within a feedback policy trained with reinforcement learning. The model-based module generates feedforward torques by solving a quadratic program, while the policy provides feedback corrective terms to enhance robustness. We validate our framework on a quadruped robot across a diverse set of real-world loco-manipulation tasks, such as pushing a shopping cart, balancing a plate, and holding soft objects, in both quadrupedal and bipedal walking. Our experiments demonstrate that Rambo enables precise manipulation capabilities while achieving robust and dynamic locomotion.
The wheel-bipedal robot has the advantages of both wheeled robots and legged robots, but as a cost, it is more challenging to perform flexible movements in various surroundings while keeping it balanced. The inaccurate dynamics of the robot makes the balance problem even more intractable. To solve this problem, the robot Ollie is used as a testbed. The whole-body control (WBC) framework is adopted to enhance the dexterity of the robot with multiple degrees of freedom in the task space. Moreover, a learning-based adaptive technique is applied to assist the WBC such that the balance controller can be designed in the absence of the accurate dynamics. Physical experiments demonstrate that the robot can manage various actions, with the help of the combination of the WBC and the learning-based adaptive technique.
Coordinated human movement depends on the integration of multisensory inputs, sensorimotor transformation, and motor execution, as well as sensory feedback resulting from body-environment interaction. Building dynamic models of the sensory-musculoskeletal system is essential for understanding movement control and investigating human behaviours. Here, we report a human sensory-musculoskeletal model, termed SMS-Human, that integrates precise anatomical representations of bones, joints, and muscle-tendon units with multimodal sensory inputs involving visual, vestibular, proprioceptive, and tactile components. A stage-wise hierarchical deep reinforcement learning framework was developed to address the inherent challenges of high-dimensional control in musculoskeletal systems with integrated multisensory information. Using this framework, we demonstrated the simulation of three representative movement tasks, including bipedal locomotion, vision-guided object manipulation, and human-machine interaction during bicycling. Our results showed a close resemblance between natural and simulated human motor behaviours. The simulation also revealed musculoskeletal dynamics that could not be directly measured. This work sheds deeper insights into the sensorimotor dynamics of human movements, facilitates quantitative understanding of human behaviours in interactive contexts, and informs the design of systems with embodied intelligence.
Maintaining stability in bipedal walking remains a significant challenge in humanoid robotics, largely due to the numerous involved hyperparameters. Traditional methods for determining these hyperparameters, such as heuristic approaches, can be both time-consuming and potentially suboptimal. In this paper, we present an approach aimed at enhancing the stability of bipedal gait, particularly when faced with floor perturbations and speed variations. Our main contribution is the integration of intrinsically stable model predictive control (IS-MPC) and whole-body admittance control within a closed-loop reinforcement learning system. We devised a reinforcement learning plugin, implemented in the mc_rtc framework, that allows the control system to continuously monitor the robot's current states, maintain recursive feasibility, and optimize parameters in real-time. Furthermore, we propose a reward function derived from a combination of changes in single and double support time, postural recovery, divergent control of motion, and action generation grounded in training optimization. In the course of this research, we conducted experiments on a real humanoid robot to validate initial aspects of our work. The integrated module's effectiveness was further assessed through comprehensive simulations.
Whole-body control for humanoids is challenging due to the high-dimensional nature of the problem, coupled with the inherent instability of a bipedal morphology. Learning from visual observations further exacerbates this difficulty. In this work, we explore highly data-driven approaches to visual whole-body humanoid control based on reinforcement learning, without any simplifying assumptions, reward design, or skill primitives. Specifically, we propose a hierarchical world model in which a high-level agent generates commands based on visual observations for a low-level agent to execute, both of which are trained with rewards. Our approach produces highly performant control policies in 8 tasks with a simulated 56-DoF humanoid, while synthesizing motions that are broadly preferred by humans.
This paper presents a comprehensive benchmarking framework for multi-skill bipedal locomotion learning using deep reinforcement learning with progressive waypoint-based reward shaping. We introduce the Skills–Algorithms–Rewards (SAR) matrix methodology for systematic evaluation of three actor-critic algorithms (DDPG, TD3, SAC) across five locomotion tasks in a 6-DOF underactuated bipedal robot simulation (Biped-5). Our progressive reward shaping strategy transitions from sparse (2 points) to dense (4+ points) waypoint configurations, enabling quantitative analysis of reward density effects on learning performance. Experimental results reveal distinct algorithmic superiority: SAC excels in stability-critical tasks, achieving 2% fall rates and superior energy efficiency (7.3J), while TD3 dominates dynamic locomotion with 4% fall rates and optimal cost of transport (1.21). SAC demonstrates robust waypoint navigation with 94% success rates and minimal deviation (0.21m), maintaining 84% generalization in complex scenarios. DDPG consistently underperforms across all tasks with 24-94% fall rates due to exploration limitations. Learning curves show continued improvement potential beyond 10M training steps. The Biped-5 benchmark suite establishes task-specific algorithmic guidelines and provides a standardized evaluation platform for advancing bipedal locomotion research.
This study proposes a reinforcement learning framework for dynamic balance control in underactuated bipedal robots. Leveraging Gazebo simulations and a hybrid reward function, the framework integrates proprioceptive sensing and PD-controlled joint actions to enable stable locomotion. A decaying noise strategy balances exploration and exploitation, guiding the robot from instability to reliable target navigation. Training over 100,000 iterations demonstrated policy convergence, with improved rewards and task completion times.
A novel push recovery control for biped robots based on the LIP model using reinforcement learning (RL) is proposed. This research delves into the application of machine learning to enhance the control and adaptability of biped robots in uncertain environments Recovering the balance of standing against external disturbances is the main goal of this paper. Thus, we propose Deep Deterministic Policy Gradient (DDPG) as a powerful alternative instead of traditional methods for push recovery control. We use the SimScape toolbox of MATLAB to develop a 7-link multi-body model and train it via DDPG algorithm to recover its balance versus a range of push intensities disturb it. We evaluate success of this control strategy via simulations. The results show a significant improvement in balance recovery capabilities of our method.
The current research on biped robots is mostly concentrated on ideal conditions, while in reality, the robot system is always subject to many external disturbances and internal parameter perturbations, which will lead to such defects as weak anti-interference ability, low real-time performance, low accuracy of controller trajectory tracking, weak balance control ability, and solving its standing posture balance problem is the premise to solve the robot walking stability. Combining the advanced depth deterministic strategy gradient algorithm (DDPG) and extended state observer (ESO), a control strategy for biped robot upright disturbance rejection balance is designed, which can quickly and timely adjust the pitch angle of the robot hip joint to restore the balance state. The DDPG controller adjusts the joint angle in real time through the difference between the actual zero torque point (ZMP) and the expected value to suppress the disturbance; ESO estimates and compensates the disturbance of unmodeled dynamics and internal parameter perturbation, and outputs joint angle compensation to improve control accuracy and rapidity. In order to verify the effectiveness of ESO-DDPG controller, it is applied to the interference suppression experiment of a stationary NAO robot. When the biped robot is disturbed by external disturbances and internal parameter perturbations, it can quickly restore its standing posture balance in real time.
No abstract available
As technology rapidly evolves, the application of bipedal robots in various environments has widely expanded. These robots, compared to their wheeled counterparts, exhibit a greater degree of freedom and a higher complexity in control, making the challenge of maintaining balance and stability under changing wind speeds particularly intricate. Overcoming this challenge is critical as it enables bipedal robots to sustain more stable gaits during outdoor tasks, thereby increasing safety and enhancing operational efficiency in outdoor settings. To transcend the constraints of existing methodologies, this research introduces an adaptive bio-inspired exploration framework for bipedal robots facing wind disturbances, which is based on the Deep Deterministic Policy Gradient (DDPG) approach. This framework allows the robots to perceive their bodily states through wind force inputs and adaptively modify their exploration coefficients. Additionally, to address the convergence challenges posed by sparse rewards, this study incorporates Hindsight Experience Replay (HER) and a reward-reshaping strategy to provide safer and more effective training guidance for the agents. Simulation outcomes reveal that robots utilizing this advanced method can more swiftly explore behaviors that contribute to stability in complex conditions, and demonstrate improvements in training speed and walking distance over traditional DDPG algorithms.
In this paper a 6 DOF biped robot model is designed, and deep reinforcement machine learning methods are implemented for the robot to learn efficient walking following a straight line. Detailed procedure of the robot design, development of a Simulink model and implementation of learning procedures is presented. Two approaches were compared for motion learning – Deep Deterministic Policy Gradient (DDPG), and Twin-Delayed Deep Deterministic Policy Gradient (TD3). The results show that both approaches are successful in generating a model with free continuous action learning and input to action mapping. Additionally, our results show that the TD3 algorithm outperforms the DDPG algorithm in the problem as formulated in this study.
We propose a hierarchical reinforcement learning framework that synergistically integrates curriculum learning, diversity-optimized experience replay, and physics-guided reward design to address the dual challenges of sample inefficiency and dynamic balancing in bipedal locomotion. Our methodology features three key innovations: 1) A phased curriculum learning strategy that progressively escalates environmental complexity from static terrain adaptation to dynamic perturbation resistance and multimodal terrain generalization; 2) A partitioned experience replay system with diversity-aware buffer management using dual-criteria prioritization (TD-error thresholds and trajectory similarity metrics); 3) Physics-informed reward engineering with smooth transition mechanisms between dense foot-trajectory tracking rewards and sparse velocity-based rewards. Experimental validation on the Joyson Bipedal Robot demonstrates 14.9% improvement in final reward metrics compared to baseline RL. The framework achieves robust locomotion under significant external perturbations and adapts to untrained terrains with 97.5% motion fidelity. Our modular architecture establishes new benchmarks for sample-efficient humanoid control while maintaining real-time responsiveness (100Hz policy execution, 1000Hz PD control), showing significant potential for deployment in unstructured environments requiring safe human-robot interaction.
This work presents a hierarchical framework for bipedal locomotion that combines a Reinforcement Learning (RL)-based high-level (HL) planner policy for the online generation of task space commands with a model-based low-level (LL) controller to track the desired task space trajectories. Different from traditional end-to-end learning approaches, our HL policy takes insights from the angular momentum-based linear inverted pendulum (ALIP) to carefully design the observation and action spaces of the Markov Decision Process (MDP). This simple yet effective design creates an insightful mapping between a low-dimensional state that effectively captures the complex dynamics of bipedal locomotion and a set of task space outputs that shape the walking gait of the robot. The HL policy is agnostic to the task space LL controller, which increases the flexibility of the design and generalization of the framework to other bipedal robots. This hierarchical design results in a learning-based framework with improved performance, data efficiency, and robustness compared with the ALIP model-based approach and state-of-the-art learning-based frameworks for bipedal locomotion. The proposed hierarchical controller is tested in three different robots, Rabbit, a five-link underactuated planar biped; Walker2D, a seven-link fully-actuated planar biped; and Digit, a 3D humanoid robot with 20 actuated joints. The trained policy naturally learns human-like locomotion behaviors and is able to effectively track a wide range of walking speeds while preserving the robustness and stability of the walking gait even under adversarial conditions.
No abstract available
Bipedal walking is a fundamental mode of locomotion in nature and has long inspired the development of humanoid robots. Passive dynamic walking stands out among bipedal locomotion strategies for its energy efficiency and human-like gait, but suffers from low stability and sensitivity to terrain. In this paper, we propose a novel method for realizing passive walking on a physical robot by combining a kneed walker model, a variant of passive dynamic walking, with reinforcement learning. Specifically, we develop a hierarchical control framework in which high-level behaviors are planned using a simplified passive walking model, while low-level control is handled by a reinforcement learning policy trained to bridge the gap between the idealized model and real-world dynamics. The method is validated in a Sim2Sim using two different robot platforms in the MuJoCo simulation environment. The proposed approach is compared to other approaches, demonstrating that it significantly reduces energy consumption-by approximately $\mathbf{1 6 \%}$ and $\mathbf{3 1 \%}$ in cost of transport (CoT) for two types of robots. Additionally, it generates more natural and stable motions, as shown by lower ground reaction forces.
Terrain-Adaptive Bipedal Locomotion via Reinforcement Learning with Human-Inspired Stepping Strategy
No abstract available
Bipedal robots, due to their anthropomorphic design, offer substantial potential across various applications, yet their control is hindered by the complexity of their structure. Currently, most research focuses on proprioception-based methods, which lack the capability to overcome complex terrain. While visual perception is vital for operation in human-centric environments, its integration complicates control further. Recent reinforcement learning (RL) approaches have shown promise in enhancing legged robot locomotion, particularly with proprioception-based methods. However, terrain adaptability, especially for bipedal robots, remains a significant challenge, with most research focusing on flat-terrain scenarios. In this paper, we introduce a novel mixture of experts teacher-student network RL strategy, which enhances the performance of teacher-student policies based on visual inputs through a simple yet effective approach. Our method combines terrain selection strategies with the teacher policy, resulting in superior performance compared to traditional models. Additionally, we introduce an alignment loss between the teacher and student networks, rather than enforcing strict similarity, to improve the student’s ability to navigate diverse terrains. We validate our approach experimentally on the Limx Dynamic P1 bipedal robot, demonstrating its feasibility and robustness across multiple terrain types.
The presence of sensor noise, missing states and inadequate future prediction capabilities imposes significant limitations on the locomotion performance of bipedal robots operating in unstructured terrain. Conventional methods generally depend on long-term history observations to reconstruct single-frame privileged information. However, these methods fail to acknowledge the pivotal function of short-term history in rapid state responses and the significance of future state prediction in anticipating potential risks. The proposed framework is a Long–Short World Model (LSWM), which integrates state reconstruction and future state prediction to enhance the locomotion capabilities of bipedal robots in complex environments. The LSWM framework comprises two modules: a state reconstruction module (SRM) and a future state prediction module (SPM). The state reconstruction module employs long-term history observations to reconstruct privileged information in the current short-term history, thereby effectively improving the system’s robustness to sensor noise and enhancing state observability. The future state prediction module enhances the robot’s adaptability to complex environments and unpredictable scenarios by predicting the robot’s future short-term privileged information. We conducted extensive comparative experiments in simulation as well as in a variety of real-world indoor and outdoor environments. In the indoor stair-climbing task, LSWM achieved a 94% success rate, outperforming the current state-of-the-art baseline methods by at least 34%, thereby demonstrating its substantial performance advantages in complex and dynamic environments.
Achieving stable and robust perceptive locomotion for bipedal robots in unstructured outdoor environments remains a critical challenge due to complex terrain geometry and susceptibility to external disturbances. In this work, we propose a novel reward design inspired by the Linear Inverted Pendulum Model (LIPM) to enable perceptive and stable locomotion in the wild. The LIPM provides theoretical guidance for dynamic balance by regulating the center of mass (CoM) height and the torso orientation. These are key factors for terrainaware locomotion, as they help ensure a stable viewpoint for the robot's camera. Building on this insight, we design a reward function that promotes balance and dynamic stability while encouraging accurate CoM trajectory tracking. To adaptively trade off between velocity tracking and stability, we leverage the Reward Fusion Module (RFM) approach that prioritizes stability when needed. A double-critic architecture is adopted to separately evaluate stability and locomotion objectives, improving training efficiency and robustness. We validate our approach through extensive experiments on a bipedal robot in both simulation and real-world outdoor environments. The results demonstrate superior terrain adaptability, disturbance rejection, and consistent performance across a wide range of speeds and perceptual conditions.
Enabling bipedal walking robots to learn how to maneuver over highly uneven, dynamically changing terrains is challenging due to the complexity of robot dynamics and interacted environments. Recent advancements in learning from demonstrations have shown promising results for robot learning in complex environments. While imitation learning of expert policies has been well-explored, the study of learning expert reward functions is largely under-explored in legged locomotion. This paper brings state-of-the-art Inverse Reinforcement Learning (IRL) techniques to solving bipedal locomotion problems over complex terrains. We propose algorithms for learning expert reward functions, and we subsequently analyze the learned functions. Through nonlinear function approximation, we uncover meaningful insights into the expert’s locomotion strategies. Furthermore, we empirically demonstrate that training a bipedal locomotion policy with the inferred reward functions enhances its walking performance on unseen terrains, highlighting the adaptability offered by reward learning.
This paper presents a novel reinforcement learning (RL) framework to design cascade feedback control policies for 3D bipedal locomotion. Existing RL algorithms are often trained in an end-to-end manner or rely on prior knowledge of some reference joint or task space trajectories. Unlike these studies, we propose a policy structure that decouples the bipedal locomotion problem into two modules that incorporate the physical insights from the nature of the walking dynamics and the well-established Hybrid Zero Dynamics approach for 3D bipedal walking. As a result, the overall RL framework has several key advantages, including lightweight network structure, sample efficiency, and less dependence on prior knowledge. The proposed solution learns stable and robust walking gaits from scratch and allows the controller to realize omnidirectional walking with accurate tracking of the desired velocity and heading angle. The learned policies also perform robustly against various adversarial forces applied to the torso and walking blindly on a series of challenging and unstructured terrains. These results demonstrate that the proposed cascade feedback control policy is suitable for navigation of 3D bipedal robots in indoor and outdoor environments.
Animals such as rabbits and birds can instantly generate locomotion behavior in reaction to a dynamic, approaching object, such as a person or a rock, despite having possibly never seen the object before and having limited perception of the object's properties. Recently, deep reinforcement learning has enabled complex kinematic systems such as humanoid robots to successfully move from point A to point B. Inspired by the observation of the innate reactive behavior of animals in nature, we hope to extend this progress in robot locomotion to settings where external, dynamic objects are involved whose properties are partially observable to the robot. As a first step toward this goal, we build a simulation environment in MuJoCo where a legged robot must avoid getting hit by a ball moving toward it. We explore whether prior locomotion experiences that animals typically possess benefit the learning of a reactive control policy under a proposed hierarchical reinforcement learning framework. Preliminary results support the claim that the learning becomes more efficient using this hierarchical reinforcement learning method, even when partial observability (radius-based object visibility) is taken into account.
Effective bipedal locomotion in dynamic environments, such as cluttered indoor spaces or uneven terrain, requires agile and adaptive movement in all directions. This necessitates omnidirectional terrain sensing and a controller capable of processing such input. We present a learning framework for vision-based omnidirectional bipedal locomotion, enabling seamless movement using depth images. A key challenge is the high computational cost of rendering omnidirectional depth images in simulation, making traditional sim-to-real reinforcement learning (RL) impractical. Our method combines a robust blind controller with a teacher policy that supervises a vision-based student policy, trained on noise-augmented terrain data to avoid rendering costs during RL and ensure robustness. We also introduce a data augmentation technique for supervised student training, accelerating training by up to 10 times compared to conventional methods. Our framework is validated through simulation and real-world tests, demonstrating effective omnidirectional locomotion with minimal reliance on expensive rendering. This is, to the best of our knowledge, the first demonstration of vision-based omnidirectional bipedal locomotion, showcasing its adaptability to diverse terrains.
This study presents an emotion-aware navigation framework -- EmoBipedNav -- using deep reinforcement learning (DRL) for bipedal robots walking in socially interactive environments. The inherent locomotion constraints of bipedal robots challenge their safe maneuvering capabilities in dynamic environments. When combined with the intricacies of social environments, including pedestrian interactions and social cues, such as emotions, these challenges become even more pronounced. To address these coupled problems, we propose a two-stage pipeline that considers both bipedal locomotion constraints and complex social environments. Specifically, social navigation scenarios are represented using sequential LiDAR grid maps (LGMs), from which we extract latent features, including collision regions, emotion-related discomfort zones, social interactions, and the spatio-temporal dynamics of evolving environments. The extracted features are directly mapped to the actions of reduced-order models (ROMs) through a DRL architecture. Furthermore, the proposed framework incorporates full-order dynamics and locomotion constraints during training, effectively accounting for tracking errors and restrictions of the locomotion controller while planning the trajectory with ROMs. Comprehensive experiments demonstrate that our approach exceeds both model-based planners and DRL-based baselines. The hardware videos and open-source code are available at https://gatech-lidar.github.io/emobipednav.github.io/.
本报告综合了基于深度强化学习的双轮足机器人控制领域的关键研究。研究脉络从基础的强化学习算法优化(如PPO、SAC的改进)出发,深入探讨了如何通过融合经典动力学模型(MPC/WBC)提升控制的物理一致性。针对真实环境部署,Sim-to-Real迁移技术与抗扰动平衡恢复构成了研究的核心壁垒。同时,通过生物启发和模仿学习,机器人步态正向自然化、多样化演进。随着视觉感知与分层规划架构的引入,双轮足机器人展现出在复杂地形下自主导航与任务执行的巨大潜力。最后,针对轮足复合形态的专项优化,标志着该领域正向更高效、更灵活的具身智能方向迈进。